Reputation: 221
I discovered a curious bug in PySpark "to_date" function
from pyspark.sql.functions import to_date, col
from datetime import date
df = spark.createDataFrame([(date(2020,12,26),)], ['t'])
df1 = df.select(to_date(df.t, 'yyyy-MM-dd').alias('dt')).withColumn('fn1',cfg.date_format(col('dt'),'YYYYMMdd'))
df1.show()
This gets the output below:
But if you use the same code for a date that's one day later,
from pyspark.sql.functions import to_date, col
from datetime import date
df = spark.createDataFrame([(date(2020,12,27),)], ['t'])
df1 = df.select(to_date(df.t, 'yyyy-MM-dd').alias('dt')).withColumn('fn1',cfg.date_format(col('dt'),'YYYYMMdd'))
df1.show()
You'll get:
The same goes for dates after 2020-12-27, but never for dates before 2020-12-25.
Upvotes: 2
Views: 376
Reputation: 42352
Y
means week year (year based on ISO week) while y
means a normal year. 2020-12-27
belongs to ISO week 1 of year 2021, so you will get 2021 if you use Y
. I think you meant to use y
in your date_format
function.
Note that week-based datetime patterns (such as y
and w
) are deprecated in Spark 3.
Upvotes: 1