Meng zhao
Meng zhao

Reputation: 221

PySpark bug when formatting date in a column

I discovered a curious bug in PySpark "to_date" function

from pyspark.sql.functions import to_date, col
from datetime import date

df = spark.createDataFrame([(date(2020,12,26),)], ['t'])
df1 = df.select(to_date(df.t, 'yyyy-MM-dd').alias('dt')).withColumn('fn1',cfg.date_format(col('dt'),'YYYYMMdd'))
df1.show()

This gets the output below:

enter image description here

But if you use the same code for a date that's one day later,

from pyspark.sql.functions import to_date, col
from datetime import date

df = spark.createDataFrame([(date(2020,12,27),)], ['t'])
df1 = df.select(to_date(df.t, 'yyyy-MM-dd').alias('dt')).withColumn('fn1',cfg.date_format(col('dt'),'YYYYMMdd'))
df1.show()

You'll get:

enter image description here

The same goes for dates after 2020-12-27, but never for dates before 2020-12-25.

Upvotes: 2

Views: 376

Answers (1)

mck
mck

Reputation: 42352

Y means week year (year based on ISO week) while y means a normal year. 2020-12-27 belongs to ISO week 1 of year 2021, so you will get 2021 if you use Y. I think you meant to use y in your date_format function.

Note that week-based datetime patterns (such as y and w) are deprecated in Spark 3.

Upvotes: 1

Related Questions