PySpark - fillna() and fill()

In PySpark, DataFrame.fillna() or DataFrameNaFunctions.fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.

While working on PySpark DataFrame we often need to replace null values since certain operations on null value return error hence, we need to graciously handle nulls as the first step before processing. Also, while writing to a file, it’s always best practice to replace null values, not doing this result nulls on the output file.

As part of the cleanup, sometimes you may need to Drop Rows with NULL/None Values in PySpark DataFrame and Filter Rows by checking IS NULL/NOT NULL conditions.

In this article, We will use both fill() and fillna() to replace null/none values with an empty string, constant value, and zero(0) on Dataframe columns integer, string with Python examples.

Before we start, Let’s read a CSV into PySpark DataFrame file, where we have no values on certain rows of String and Integer columns, PySpark assigns null values to these no value columns.

The file we are using here is available here small_zipcode.csv

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[1]") \
    .appName("mytechmint") \
    .getOrCreate()

filePath="resources/small_zipcode.csv"
df = spark.read.options(header='true', inferSchema='true') \
          .csv(filePath)

df.printSchema()
df.show(truncate=False)

This yields the below output. As you see columns type, city and population columns have null values.

+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|null               |PR   |30100     |
|2  |704    |null    |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |null    |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|null               |TX   |null      |
+---+-------+--------+-------------------+-----+----------+

Now, let’s see how to replace these null values.

Related: PySpark - substring

PySpark fillna() & fill() Syntax

PySpark provides DataFrame.fillna() and DataFrameNaFunctions.fill() to replace NULL/None values. These two are aliases of each other and returns the same results.

fillna(value, subset=None)
fill(value, subset=None)

value – Value should be the data type of int, long, float, string, or dict. Value specified here will be replaced for NULL/None values.
subset – This is optional, when used it should be the subset of the column names where you wanted to replace NULL/None values.

PySpark Replace Null/None Values with Zero

PySpark fill(value:Long) signatures that are available in DataFrameNaFunctions is used to replace NULL/None values with numeric values either zero(0) or any constant value for all integer and long datatype columns of PySpark DataFrame or Dataset.

#Replace 0 for null for all integer columns
df.na.fill(value=0).show()

#Replace 0 for null on only population column 
df.na.fill(value=0,subset=["population"]).show()

Above both statements yields the same output, since we have just an integer column population with null values Note that it replaces only Integer columns since our value is 0.

+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|null               |PR   |30100     |
|2  |704    |null    |PASEO COSTA DEL SUR|PR   |0         |
|3  |709    |null    |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|null               |TX   |0         |
+---+-------+--------+-------------------+-----+----------+

PySpark Replace Null/None Value with Empty String

Now let’s see how to replace NULL/None values with an empty string or any constant values String on all DataFrame String columns.

df.na.fill("").show(false)

Yields below output. This replaces all String type columns with empty/blank string for all NULL values.

+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|                   |PR   |30100     |
|2  |704    |        |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |        |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|                   |TX   |null      |
+---+-------+--------+-------------------+-----+----------+

Now, let’s replace NULL’s on specific columns, below example replace column type with empty string and column city with value “unknown”.

df.na.fill("unknown",["city"]) \
    .na.fill("",["type"]).show()

Yields below output. This replaces null values with an empty string for type column and replaces with a constant value “unknown” for city column.

+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|unknown            |PR   |30100     |
|2  |704    |        |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |        |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|unknown            |TX   |null      |
+---+-------+--------+-------------------+-----+----------+

Alternatively, you can also write the above statement as

df.na.fill({"city": "unknown", "type": ""}) \
    .show()

Complete Code

Below is complete code with Scala example. You can use it by copying it from here or use the GitHub to download the source code.

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[1]") \
    .appName("mytechmint") \
    .getOrCreate()

filePath="resources/small_zipcode.csv"
df = spark.read.options(header='true', inferSchema='true') \
          .csv(filePath)

df.printSchema()
df.show(truncate=False)


df.fillna(value=0).show()
df.fillna(value=0,subset=["population"]).show()
df.na.fill(value=0).show()
df.na.fill(value=0,subset=["population"]).show()


df.fillna(value="").show()
df.na.fill(value="").show()

df.fillna("unknown",["city"]) \
    .fillna("",["type"]).show()

df.fillna({"city": "unknown", "type": ""}) \
    .show()

df.na.fill("unknown",["city"]) \
    .na.fill("",["type"]).show()

df.na.fill({"city": "unknown", "type": ""}) \
    .show()

Conclusion

In this PySpark article, you have learned how to replace Null/None values with zero or an empty string on integer and string columns respectively using fill() and fillna() transformation functions.

PySpark – fillna() and fill()

PySpark fillna() & fill() Syntax

PySpark Replace Null/None Values with Zero

PySpark Replace Null/None Value with Empty String

Complete Code

Conclusion

Leave a Comment Cancel reply

PySpark fillna() & fill() Syntax

PySpark Replace Null/None Values with Zero

PySpark Replace Null/None Value with Empty String

Complete Code

Conclusion

Related Posts

Leave a Comment Cancel reply