Home » Tutorials » PySpark Tutorial » PySpark – split()

PySpark – split()

PySpark SQL provides split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting it into ArrayType.

In this article, We will explain converting String to Array column using split() function on DataFrame and SQL query.

Split() Function Syntax

PySpark SQL split() is grouped under Array Functions in PySpark SQL Functions class with the below syntax.

pyspark.sql.functions.split(str, pattern, limit=-1)

The split() function takes the first argument as the DataFrame column of type String and the second argument string delimiter that you want to split on. You can also use the pattern as a delimiter. This function returns pyspark.sql.Column of type Array.

Before we start with usage, first, let’s create a DataFrame with a string column with text separated with comma delimiter

from pyspark.sql import SparkSession
spark = SparkSession.builder \
         .appName('mytechmint') \
         .getOrCreate()

data = [("James, A, Smith","2018","M",3000),
            ("Michael, Rose, Jones","2010","M",4000),
            ("Robert,K,Williams","2010","M",4000),
            ("Maria,Anne,Jones","2005","F",4000),
            ("Jen,Mary,Brown","2010","",-1)
            ]

columns=["name","dob_year","gender","salary"]
df=spark.createDataFrame(data,columns)
df.printSchema()

This yields the below output. As you notice we have a name column with takens firstname, middle and lastname with comma separated.

#Output
root
 |-- name: string (nullable = true)
 |-- dob_year: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = false)

PySpark Convert String to Array Column

Below PySpark example snippet splits the String column name on comma delimiter and convert it to an Array. If you do not need the original column, use drop() to remove the column.

from pyspark.sql.functions import split, col
df2 = df.select(split(col("name"),",").alias("NameArray")) \
    .drop("name")
df2.printSchema()
df2.show()

This yields below output. As you see below schema NameArray is a array type.

#Output
root
 |-- NameArray: array (nullable = true)
 |    |-- element: string (containsNull = true)

+------------------------+
|NameArray               |
+------------------------+
|[James,  A,  Smith]     |
|[Michael,  Rose,  Jones]|
|[Robert, K, Williams]   |
|[Maria, Anne, Jones]    |
|[Jen, Mary, Brown]      |
+------------------------+

Convert String to Array Column Using SQL Query

Since PySpark provides a way to execute the raw SQL, let’s learn how to write the same example using Spark SQL expression.

Related:  PySpark DataFrame Basics Cheat Sheet

In order to use raw SQL, first, you need to create a table using createOrReplaceTempView(). This creates a temporary view from the Dataframe and this view is the available lifetime of the current Spark context.

df.createOrReplaceTempView("PERSON")
spark.sql("select SPLIT(name,',') as NameArray from PERSON") \
    .show()

This yields the same output as above example.

Complete Example

Below is the complete example of splitting an String type column based on a delimiter or patterns and converting into ArrayType column.

from pyspark.sql import SparkSession
spark = SparkSession.builder \
         .appName('mytechmint') \
         .getOrCreate()

data = [("James, A, Smith","2018","M",3000),
            ("Michael, Rose, Jones","2010","M",4000),
            ("Robert,K,Williams","2010","M",4000),
            ("Maria,Anne,Jones","2005","F",4000),
            ("Jen,Mary,Brown","2010","",-1)
            ]

columns=["name","dob_year","gender","salary"]
df=spark.createDataFrame(data,columns)
df.printSchema()
df.show(truncate=False)

from pyspark.sql.functions import split, col
df2 = df.select(split(col("name"),",").alias("NameArray")) \
    .drop("name")
df2.printSchema()
df2.show()

df.createOrReplaceTempView("PERSON")
spark.sql("select SPLIT(name,',') as NameArray from PERSON") \
    .show()

Conclusion

In this simple article, we have learned how to convert the string column into an array column by splitting the string by delimiter and also learned how to use the split function on PySpark SQL expression.

Leave a Comment