Home » Tutorials » PySpark Tutorial » PySpark – datediff() and months_between()

PySpark – datediff() and months_between()

Using PySpark SQL functions datediff()months_between() you can calculate the difference between two dates in days, months, and years, let’s see this by using a DataFrame example. You can also use these to calculate age.

datediff() Function

First Let’s see getting the difference between two dates using datediff() PySpark function.

from pyspark.sql.functions import *
data = [("1","2019-07-01"),("2","2019-06-24"),("3","2019-08-24")]
df=spark.createDataFrame(data=data,schema=["id","date"])

df.select(
      col("date"),
      current_date().alias("current_date"),
      datediff(current_date(),col("date")).alias("datediff")
    ).show()
#Output
+----------+------------+--------+
|      date|current_date|datediff|
+----------+------------+--------+
|2019-07-01|  2021-02-26|     606|
|2019-06-24|  2021-02-26|     613|
|2019-08-24|  2021-02-26|     552|
+----------+------------+--------+

months_between() Function

Now, Let’s see how to get month and year differences between two dates using months_between() function.

from pyspark.sql.functions import *
df.withColumn("datesDiff", datediff(current_date(),col("date"))) \
  .withColumn("montsDiff", months_between(current_date(),col("date"))) \
  .withColumn("montsDiff_round",round(months_between(current_date(),col("date")),2)) \
  .withColumn("yearsDiff",months_between(current_date(),col("date"))/lit(12)) \
  .withColumn("yearsDiff_round",round(months_between(current_date(),col("date"))/lit(12),2)) \
  .show()

Yields below output. Note that here we use round() function and lit() functions on top of months_between() to get the year between two dates.

#Output
+---+----------+---------+-----------+---------------+------------------+---------------+
| id|      date|datesDiff|  montsDiff|montsDiff_round|         yearsDiff|yearsDiff_round|
+---+----------+---------+-----------+---------------+------------------+---------------+
|  1|2019-07-01|      606|19.80645161|          19.81|1.6505376341666667|           1.65|
|  2|2019-06-24|      613|20.06451613|          20.06|1.6720430108333335|           1.67|
|  3|2019-08-24|      552|18.06451613|          18.06|1.5053763441666668|           1.51|
+---+----------+---------+-----------+---------------+------------------+---------------+

Let’s see another example of the difference between two dates when dates are not in PySpark DateType format yyyy-MM-dd. when dates are not in DateType format, all date functions return null. Hence, you need to first convert the input date to Spark DateType using to_date() the function.

from pyspark.sql.functions import *
data2 = [("1","07-01-2019"),("2","06-24-2019"),("3","08-24-2019")]  
df2=spark.createDataFrame(data=data2,schema=["id","date"])
df2.select(
    to_date(col("date"),"MM-dd-yyyy").alias("date"),
    current_date().alias("endDate")
    )

SQL Example

Let’s see how to calculate the difference between two dates in years using the PySpark SQL example. Similarly, you can calculate the days and months between two dates.

spark.sql("select round(months_between('2019-07-01',current_date())/12,2) as years_diff").show()

Complete Example

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
          .appName('mytechmint') \
          .getOrCreate()
data = [("1","2019-07-01"),("2","2019-06-24"),("3","2019-08-24")]

df=spark.createDataFrame(data=data,schema=["id","date"])

from pyspark.sql.functions import *

df.select(
      col("date"),
      current_date().alias("current_date"),
      datediff(current_date(),col("date")).alias("datediff")
    ).show()

df.withColumn("datesDiff", datediff(current_date(),col("date"))) \
      .withColumn("montsDiff", months_between(current_date(),col("date"))) \
      .withColumn("montsDiff_round",round(months_between(current_date(),col("date")),2)) \
      .withColumn("yearsDiff",months_between(current_date(),col("date"))/lit(12)) \
      .withColumn("yearsDiff_round",round(months_between(current_date(),col("date"))/lit(12),2)) \
      .show()

data2 = [("1","07-01-2019"),("2","06-24-2019"),("3","08-24-2019")]  
df2=spark.createDataFrame(data=data2,schema=["id","date"])
df2.select(
    to_date(col("date"),"MM-dd-yyyy").alias("date"),
    current_date().alias("endDate")
    )

spark.sql("select round(months_between('2019-07-01',current_date())/12,2) as years_diff").show()

Conclusion

In this tutorial, we have learned how to calculate days, months, and years between two dates using PySpark Date and time functions datediff(), months_between().

Leave a Comment