Change Timestamp Year in Spark

Spark does not provide a standard function to modify a date. In my case users have incorrectly entered a year so the timestamps came out as 0021-01-31 15:00:00 where they need to be 2021-01-31 15:00:00. The trick I used to change the year:

df = (df
      .withColumn("dateTime1", f.concat(f.year(df.dateTime) + 2000,
                                        f.date_format(df.dateTime, "-MM-dd HH:mm:ss")).cast("timestamp")))

What it does:

  1. Format timestamp as -MM-dd HH:mm:ss i.e. part without a year.
  2. Get year part with built-in year function in Spark.
  3. Concatenate year with formatted string.
  4. Parse the string as timestamp again.

And sample result:

dateTime dateTime1
0021-01-27 17:15:00 2021-01-27 17:15:00
0021-01-28 06:30:00 2021-01-28 06:30:00
0021-01-30 06:33:00 2021-01-30 06:33:00
0021-01-31 07:00:00 2021-01-31 07:00:00
0021-01-31 07:00:00 2021-01-31 07:00:00
0201-01-31 18:40:00 2201-01-31 18:40:00


To contact me, send an email anytime or leave a comment below.