Home » Tutorials » PySpark Tutorial » PySpark – lit()

PySpark – lit()

PySpark SQL functions lit() and typedLit() are used to add a new column to DataFrame by assigning a literal or constant value. Both these functions return Column type as return type.

Both of these are available in PySpark by importing pyspark.sql.functions

First, let’s create a DataFrame.

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('mytechmint').getOrCreate()
data = [("111",50000),("222",60000),("333",40000)]
columns= ["EmpId","Salary"]
df = spark.createDataFrame(data = data, schema = columns)

lit() Function to Add Constant Column

PySpark lit() function is used to add constant or literal value as a new column to the DataFrame.

Creates a [[Column]] of literal value. The passed in object is returned directly if it is already a [[Column]]. If the object is a Scala Symbol, it is converted into a [[Column]] also. Otherwise, a new [[Column]] is created to represent the literal value

Let’s take a look at some examples.

Example 1: Simple Usage of lit() Function

Let’s see an example of how to create a new column with constant value using lit() Spark SQL function. On the below snippet, we are creating a new column by adding a literal ‘1’ to PySpark DataFrame.

from pyspark.sql.functions import col,lit
df2 = df.select(col("EmpId"),col("Salary"),lit("1").alias("lit_value1"))
df2.show(truncate=False)
#Output
+-----+------+----------+
|EmpId|Salary|lit_value1|
+-----+------+----------+
|  111| 50000|         1|
|  222| 60000|         1|
|  333| 40000|         1|
+-----+------+----------+

Adding the same constant literal to all records in DataFrame may not be real-time useful so let’s see another example.

Related:  PySpark - flatMap()

Example 2 : lit() Function with withColumn

The following example shows how to use pyspark lit() function using withColumn to derive a new column based on some conditions.

from pyspark.sql.functions import when, lit, col
df3 = df2.withColumn("lit_value2", when(col("Salary") >=40000 & col("Salary") <= 50000,lit("100")).otherwise(lit("200")))
df3.show(truncate=False)

Below is the output for the above code snippet.

#Output
+-----+------+----------+----------+
|EmpId|Salary|lit_value1|lit_value2|
+-----+------+----------+----------+
|  111| 50000|         1|       100|
|  222| 60000|         1|       200|
|  333| 40000|         1|       100|
+-----+------+----------+----------+

typedLit() Function – Syntax

Difference between lit() and typedLit() is that, typedLit function can handle collection types e.g.: Array, Dictionary(map) e.t.c.

Complete Example of How to Add Constant Column

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('mytechmint').getOrCreate()
data = [("111",50000),("222",60000),("333",40000)]
columns= ["EmpId","Salary"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

from pyspark.sql.functions import col,lit
df2 = df.select(col("EmpId"),col("Salary"),lit("1").alias("lit_value1"))
df2.show(truncate=False)
from pyspark.sql.functions import when
df3 = df2.withColumn("lit_value2", when(col("Salary") >=40000 & col("Salary") <= 50000,lit("100")).otherwise(lit("200")))
df3.show(truncate=False)

Conclusion

We have learned multiple ways to add a constant literal value to DataFrame using PySpark lit() function and have learned the difference between lit and typedLit functions.

When possible try to use predefined PySpark functions as they are a little bit more compile-time safety and perform better when compared to user-defined functions. If your application is critical on performance try to avoid using custom UDF functions as these are not guarantee on performance.

Leave a Comment