In this tutorial, we will learn about The Most Useful Date Manipulation Functions in Spark in Details.

DateTime functions will always be tricky but very important irrespective of language or framework. In this blog post, we review the DateTime functions available in Apache Spark.

Pyspark and Spark SQL provide many built-in functions. The functions such as the date and time functions are useful when you are working with DataFrame which stores date and time type values. Sometimes you may be tempted to write a UDF(user-defined functions) only later to realize that it would have been better to check out the documentation because it already exists. In this article, we will check what are the most important Spark SQL date functions with some examples.

Spark SQL Date Functions

n date functions are user and performance-friendly. Spark SQL supports almost all date functions that are supported in Apache Hive. Following in the table below are the Spark SQL date functions these can be used to manipulate the data frame columns that contain data type values. The list contains pretty much all date functions that are supported in Apache Spark.

Setup

Here are a few key setup steps before trying the examples further down.

Import Functions in PySpark

from pyspark.sql.functions import *

Create Sample DataFrame

Let’s try to create a sample DataFrame so that we can use it for the rest of this blog to understand the various DateTime functions.

emp = [(1, "AAA", "dept1", 1000, "2019-02-01 15:12:13"),
(2, "BBB", "dept1", 1100, "2018-04-01 5:12:3"),
(3, "CCC", "dept1", 3000, "2017-06-05 1:2:13"),
(4, "DDD", "dept1", 1500, "2019-08-10 10:52:53"),
(5, "EEE", "dept2", 8000, "2016-01-11 5:52:43"),
(6, "FFF", "dept2", 7200, "2015-04-14 19:32:33"),
(7, "GGG", "dept3", 7100, "2019-02-21 15:42:43"),
(8, "HHH", "dept3", 3700, "2016-09-25 15:32:33"),
(9, "III", "dept3", 4500, "2017-10-15 15:22:23"),
(10, "JJJ", "dept5", 3400, "2018-12-17 15:14:17")]
empdf = spark.createDataFrame(emp, ["id", "name", "dept", "salary", "date"])

add_months

This function adds months to a date. It will return a new date, however many months from the start date. In the below statement we add 1 month to the column “date” and generated a new column as “next_month”.

df = (empdf
    .select("date")
    .withColumn("next_month", add_months("date", 1)))
df.show(2)

Output

+-------------------+--------------+
|         date      |    next_month|
+-------------------+--------------+
|2019-08-10 10:52:53|    2019-09-10|
|  2018-04-01 5:12:3|    2018-05-01|
+-------------------+--------------+

current_date

This function returns the current date.

df = (empdf
    .withColumn("current_date", current_date())
    .select("id", "current_date"))
df.show(2)

Output

+---+------------+
| id|current_date|
+---+------------+
|  5|  2019-10-07|
| 10|  2019-10-07|
+---+------------+

current_timestamp

This function returns the current timestamp.

df = (empdf
    .withColumn("current_timestamp", current_timestamp())
    .select("id", "current_timestamp"))
df.show(2,False)

Output

+---+-----------------------+
|id |current_timestamp      |
+---+-----------------------+
|1  |2019-10-07 14:40:42.471|
|2  |2019-10-07 14:40:42.471|
+---+-----------------------+

date_add

in a new column as “next_date”. E.g. for date: 1st Feb 2019 it returns 6th Feb 2019.

df = (empdf
    .select("date")
    .withColumn("next_date", date_add("date", 5)))
df.show(2)

Output

+-------------------+----------+
|               date| next_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-06|
|  2018-04-01 5:12:3|2018-04-06|
+-------------------+----------+

date_format

This function will convert the date to the specified format. For example, we can convert the date from “yyyy-MM-dd” to “dd/MM/yyyy” format.

df = (empdf
    .select("date")
    .withColumn("new_date", date_format("date", "dd/MM/yyyy")))
df.show(2)

Output

+-------------------+----------+
|               date|  new_date|
+-------------------+----------+
|2019-02-01 15:12:13|01/02/2019|
|  2018-04-01 5:12:3|01/04/2018|
+-------------------+----------+

date_sub

date_add. In the example below, it returns a date that is 5 days earlier in a column as “new_date”. For example, date 1st Feb 2019 returns 27th Jan 2019.

df = (empdf
    .select("date")
    .withColumn("new_date", date_sub("date", 5)))
df.show(2)

Output

+-------------------+----------+
|               date|  new_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-01-27|
|  2018-04-01 5:12:3|2018-03-27|
+-------------------+----------+

date_trunc

This function returns a timestamp truncated to the specified unit. It could be a year, month, day, hour, minute, second, week or quarter.

Let’s truncate the date by a year. we can use “yyyy” or “yy” or” “year” to specify year. For timestamp “2019–02–01 15:12:13”, if we truncate based on the year it will return “2019–01–01 00:00:00”

df = (empdf
    .select("date")
    .withColumn("new_date", date_trunc("year", "date")))
df.show(2)

Output

+-------------------+-------------------+
|               date|           new_date|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-01-01 00:00:00|
|  2018-04-01 5:12:3|2018-01-01 00:00:00|
+-------------------+-------------------+

Let’s truncate date by month. We use “mm” or “month” or” “mon” to specify month. For timestamp “2019–02–01 15:12:13”, if we truncate based on month it returns “2019–02–01 00:00:00”

df = (empdf
    .select("date")
    .withColumn("new_date", date_trunc("month", "date")))
df.show(2)

Output

+-------------------+-------------------+
|               date|           new_date|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 00:00:00|
|  2018-04-01 5:12:3|2018-04-01 00:00:00|
+-------------------+-------------------+

Let’s truncate date by day. We can use “day” or “dd” to specify day. For timestamp “2019–02–10 15:12:13”, if we truncate based on day it will return “2019–02–10 00:00:00”

df = (empdf
    .select("date")
    .withColumn("new_date", date_trunc("day", "date")))
df.show(2)

Output

+-------------------+-------------------+
|               date|           new_date|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 00:00:00|
|  2018-04-01 5:12:3|2018-04-01 00:00:00|
+-------------------+-------------------+

we can use these many formats to truncate the date based on different levels. we can use these formats: ‘year’, ‘yyyy’, ‘yy’, ‘month’, ‘mon’, ‘mm’, ‘day’, ‘dd’, ‘hour’, ‘minute’, ‘second’, ‘week’, ‘quarter’.

Related: PySpark - when

datediff

This function returns the difference between dates in terms of days. Let’s add another column as the current date and then take the difference between “current_date” and “date”.

Output

+-------------------+------------+---------+
|               date|current_date|date_diff|
+-------------------+------------+---------+
|2019-02-01 15:12:13|  2019-10-07|      248|
|  2018-04-01 5:12:3|  2019-10-07|      554|
+-------------------+------------+---------+

dayofmonth

This function returns the day of the month. For 5th Jan 2019 (2019–01–05) it will return 5.

df = (empdf
    .select("date")
    .withColumn("dayofmonth", dayofmonth("date")))
df.show(2)

Output

+-------------------+----------+
|               date|dayofmonth|
+-------------------+----------+
|2019-02-01 15:12:13|         1|
|  2018-04-01 5:12:3|         1|
+-------------------+----------+

dayofweek

This function returns the day of the week as an integer. It will consider Sunday as 1st and Saturday as 7th. For 1st Feb 2019 (2019–02–01) which is Friday, it will return 6. Similarly, for 1st April 2018 (2018–04–01) which is Sunday, it will return 1.

df = (empdf
    .select("date")
    .withColumn("dayofweek", dayofweek("date")))
df.show(2)

Output

+-------------------+---------+
|               date|dayofweek|
+-------------------+---------+
|2019-02-01 15:12:13|        6|
|  2018-04-01 5:12:3|        1|
+-------------------+---------+

dayofyear

This function returns the day of the year as an integer. For 1st Feb it will return 32 (31 days of Jan +1 day of Feb). For 1st April 2018, it will return 91 (31 days of Jan + 28 days of Feb (2018 is a non-leap year) + 31 days of Mar + 1 day of April).

df = (empdf
    .select("date")
    .withColumn("dayofyear", dayofyear("date")))
df.show(2)

Output

+-------------------+---------+
|               date|dayofyear|
+-------------------+---------+
|2019-02-01 15:12:13|       32|
|  2018-04-01 5:12:3|       91|
+-------------------+---------+

from_utc_timestamp

This function converts UTC timestamps to timestamps of any specified timezone. By default, it assumes the date is a UTC timestamp.
Let’s convert a UTC timestamp to “PST” time.

df = (empdf
    .select("date")
    .withColumn("pst_timestamp", from_utc_timestamp("date", "PST")))
df.show(2)

Output

+-------------------+-------------------+
|               date|      pst_timestamp|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 07:12:13|
|  2018-04-01 5:12:3|2018-03-31 22:12:03|
+-------------------+-------------------+

unix_timestamp

spark.sql.session.timeZone” to set the timezone.)

df = (empdf
    .select("date")
    .withColumn("unix_timestamp", unix_timestamp("date", "yyyy-MM-dd HH:mm:ss")))
df.show(2)

Output

+-------------------+--------------+
|               date|unix_timestamp|
+-------------------+--------------+
|2019-02-01 15:12:13|    1549033933|
|  2018-04-01 5:12:3|    1522559523|
+-------------------+--------------+

from_unixtime

spark.sql.session.timeZone” to set the timezone). For demonstration purposes, we have converted the timestamp to Unix timestamp and converted it back to timestamp.

df = (empdf
    .select("date")
    # Convert timestamp to unix timestamp.
    .withColumn("unix_timestamp", unix_timestamp("date", "yyyy-MM-dd HH:mm:ss"))
    # Convert unix timestamp to timestamp.
    .withColumn("date_from_unixtime", from_unixtime("unix_timestamp")))
df.show(2)

Output

+-------------------+--------------+-------------------+
|               date|unix_timestamp| date_from_unixtime|
+-------------------+--------------+-------------------+
|2019-02-01 15:12:13|    1549033933|2019-02-01 15:12:13|
|  2018-04-01 5:12:3|    1522559523|2018-04-01 05:12:03|
+-------------------+--------------+-------------------+

hour

This function will return the hour part of the date.

df = (empdf
    .select("date")
    .withColumn("hour", hour("date")))
df.show(2)

Output

+-------------------+----+
|               date|hour|
+-------------------+----+
|2019-02-01 15:12:13|  15|
|  2018-04-01 5:12:3|   5|
+-------------------+----+

last_day

This function will return the last date of the month for a given date. For 5th Jan 2019, it will return 31st Jan 2019, since this is the last date for the month.

df = empdf.select("date").withColumn("last_date", last_day("date"))

Output

+-------------------+----------+
|               date| last_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-28|
|  2018-04-01 5:12:3|2018-04-30|
+-------------------+----------+

minute

This function will return minute part of the date.

df = (empdf
    .select("date")
    .withColumn("minute", minute("date")))
df.show(2)

Output

+-------------------+------+
|               date|minute|
+-------------------+------+
|2019-02-01 15:12:13|    12|
|  2018-04-01 5:12:3|    12|
+-------------------+------+

month

This function will return the month part of the date.

df = (empdf
    .select("date")
    .withColumn("month", month("date")))
df.show(2)

Output

+-------------------+-----+
|               date|month|
+-------------------+-----+
|2019-02-01 15:12:13|    2|
|  2018-04-01 5:12:3|    4|
+-------------------+-----+

months_between

This function returns the difference between dates in terms of months. If the first date is greater than the second one, the result will be positive else negative. For example, between 6th Feb 2019 and 5th Jan 2019, it will return 1.

df = (empdf
    .select("date")
    # Add another date column as current date.        
    .withColumn("current_date", current_date()) 
    # Take the difference between current_date and date column in terms of months.
    .withColumn("months_between", months_between("current_date", "date"))) 
df.show(2)

Note: from Spark 2.4.0 onwards you can specify the third argument “roundOff=True” to round-Off the value. The default value is True.

Related: Running Excel VBA from Python

Output

+-------------------+------------+--------------+
|               date|current_date|months_between|
+-------------------+------------+--------------+
|2019-02-01 15:12:13|  2019-10-09|    8.23762955|
|  2018-04-01 5:12:3|  2019-10-09|   18.25107415|
+-------------------+------------+--------------+

next_day

This function will return the next day based on the dayOfWeek specified in the next argument. For e.g. for 1st Feb 2019 (Friday) if we ask for next_day as Sunday, it will return 3rd Feb 2019.

df = (empdf
    .select("date")
    .withColumn("next_day", next_day("date", "sun")))
df.show(2)

Output

+-------------------+----------+
|               date|  next_day|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-03|
|  2018-04-01 5:12:3|2018-04-08|
+-------------------+----------+

quarter

This function will return a quarter of the given date as an integer.

df = (empdf
    .select("date")
    .withColumn("quarter", quarter("date")))
df.show(2)

Output

+-------------------+-------+
|               date|quarter|
+-------------------+-------+
|2019-02-01 15:12:13|      1|
|  2018-04-01 5:12:3|      2|
+-------------------+-------+

second

This function will return the second part of the date.

df = (empdf
    .select("date")
    .withColumn("second", second("date")))
df.show(2)

Output

+-------------------+------+
|               date|second|
+-------------------+------+
|2019-02-01 15:12:13|    13|
|  2018-04-01 5:12:3|     3|
+-------------------+------+

to_date

This function will convert the String or TimeStamp to Date.

df = (empdf
    .select("date")
    .withColumn("to_date", to_date("date")))
df.show(2)

Note: Check the data type of column “date” and “to-date”.

If the string format is ‘yyyy-MM-dd HH:mm:ss’ then we need not specify the format. Otherwise, specify the format as the second arg in to_date function.

Output

+-------------------+----------+
|               date|   to_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-01|
|  2018-04-01 5:12:3|2018-04-01|
+-------------------+----------+

Here we convert a string of format ‘dd/MM/yyyy HH:mm:ss’ to “date” data type. Note the default format is ‘yyyy-MM-dd`.

df1 = spark.createDataFrame([('15/02/2019 10:30:00',)], ['date'])
df2 = (df1
    .withColumn("new_date", to_date("date", 'dd/MM/yyyy HH:mm:ss')))    
df2.show(2)

Output

+-------------------+----------+
|               date|  new_date|
+-------------------+----------+
|15/02/2019 10:30:00|2019-02-15|
+-------------------+----------+

to_timestamp

This function converts String to TimeStamp. Here we convert a string of format ‘dd/MM/yyyy HH:mm:ss’ to the “timestamp” data type. The default format is ‘yyyy-MM-dd HH:mm:ss’

df1 = spark.createDataFrame([('15/02/2019 10:30:00',)], ['date'])
df2 = (df1
    .withColumn("new_date", to_timestamp("date", 'dd/MM/yyyy HH:mm:ss')))
df2.show(2)

Output

+-------------------+-------------------+
|               date|           new_date|
+-------------------+-------------------+
|15/02/2019 10:30:00|2019-02-15 10:30:00|
+-------------------+-------------------+

to_utc_timestamp

This function converts given timestamp to UTC timestamp. Let’s convert a “PST” timestamp to “UTC” timestamp.

df = (empdf
    .select("date")
    .withColumn("utc_timestamp", to_utc_timestamp("date", "PST")))
df.show(2)

Output

+-------------------+-------------------+
|               date|      utc_timestamp|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 23:12:13|
|  2018-04-01 5:12:3|2018-04-01 12:12:03|
+-------------------+-------------------+

weekofyear

This function will return the weekofyear for the given date.

df = (empdf
    .select("date")
    .withColumn("weekofyear", weekofyear("date")))
df.show(2)

Output

+-------------------+----------+
|               date|weekofyear|
+-------------------+----------+
|2019-02-01 15:12:13|         5|
|  2018-04-01 5:12:3|        13|
+-------------------+----------+

year

This function will return the year part of the date.

df = (empdf
    .select("date")
    .withColumn("year", year("date")))
df.show(2)

Output

+-------------------+----+
|               date|year|
+-------------------+----+
|2019-02-01 15:12:13|2019|
|  2018-04-01 5:12:3|2018|
+-------------------+----+

Spark SQL Date Functions

Setup

Import Functions in PySpark

Create Sample DataFrame

add_months

Output

current_date

Output

current_timestamp

Output

date_add

Output

date_format

Output

date_sub

Output

date_trunc

Output

Output

Output

datediff

Output

dayofmonth

Output

dayofweek

Output

dayofyear

Output

from_utc_timestamp

Output

unix_timestamp

Output

from_unixtime

Output

hour

Output

last_day

Output

minute

Output

month

Output

months_between

Output

next_day

Output

quarter

Output

second

Output

to_date

Output

Output

to_timestamp

Output

to_utc_timestamp

Output

weekofyear

Output

year

Output

Related Posts

Leave a Comment Cancel reply