PySpark bug when formatting date in a column

Question

I discovered a curious bug in PySpark "to_date" function

from pyspark.sql.functions import to_date, col
from datetime import date

df = spark.createDataFrame([(date(2020,12,26),)], ['t'])
df1 = df.select(to_date(df.t, 'yyyy-MM-dd').alias('dt')).withColumn('fn1',cfg.date_format(col('dt'),'YYYYMMdd'))
df1.show()

This gets the output below:

But if you use the same code for a date that's one day later,

from pyspark.sql.functions import to_date, col
from datetime import date

df = spark.createDataFrame([(date(2020,12,27),)], ['t'])
df1 = df.select(to_date(df.t, 'yyyy-MM-dd').alias('dt')).withColumn('fn1',cfg.date_format(col('dt'),'YYYYMMdd'))
df1.show()

You'll get:

The same goes for dates after 2020-12-27, but never for dates before 2020-12-25.

PySpark bug when formatting date in a column

Answers (1)

Related Questions