lakeuk
lakeuk

Reputation: 201

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0

from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime
df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str'])
df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date'))
df2.show(1, False)

In my sandbox environment I've updated to spark 3.0 and now get the following error for the above code, is there a new method of doing this in 3.0 to convert my string to a date

: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter.

  1. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0.
  2. You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Upvotes: 15

Views: 78848

Answers (4)

Manjunath M
Manjunath M

Reputation: 11

Parsing/formatting of timestamp/date strings. This effects on CSV/JSON datasources and on the unix_timestamp, date_format, to_unix_timestamp, from_unixtime, to_date, to_timestamp functions when patterns specified by users is used for parsing and formatting. In Spark 3.0, we define our own pattern strings in Datetime Patterns for Formatting and Parsing, which is implemented via DateTimeFormatter under the hood. New implementation performs strict checking of its input. For example, the 2015-07-22 10:00:00 timestamp cannot be parse if pattern is yyyy-MM-dd because the parser does not consume whole input. Another example is the 31/01/2015 00:00 input cannot be parsed by the dd/MM/yyyy hh:mm pattern because hh supposes hours in the range 1-12. In Spark version 2.4 and below, java.text.SimpleDateFormat is used for timestamp/date string conversions, and the supported patterns are described in SimpleDateFormat. The old behavior can be restored by setting spark.sql.legacy.timeParserPolicy to LEGACY

So,Adding this spark config to your code will fix this issue :

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

Upvotes: 1

Chris Huang-Leaver
Chris Huang-Leaver

Reputation: 6089

The difference between the Legacy and the current version of Spark is subtle

for example:

spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")

df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss a')).show()

Outputs the following:

+----------------------------------------------+
|to_timestamp(Christmas, MM/dd/yyyy hh:mm:ss a)|
+----------------------------------------------+
|                           2019-12-25 13:30:00|
+----------------------------------------------+

However

spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")

df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss aa')).show()

Will raise a SparkUpgradeException

Notice we have 'aa' in the time format not just one.

According to Java Docs, which is what the to_timestamp function uses, 'aa' was always wrong, I guess the earlier version of Spark was more lenient.

So either fix the date formats or set the timeParserPolicy to 'LEGACY' as Shivam suggested.

Upvotes: 1

lakeuk
lakeuk

Reputation: 201

Thanks for responses, excellent advice, for the moment I'll be going with the LEGACY setting. I have a workaround with Spark 3.0 by substringing out the EEE element but I've noticed a bug with how BST timezone converts incorrectly offseting by 10 hours while under LEGACY it correctly remains the same as I'm currently in BST zone. I can do something with this but will wait till the clocks change in the autumn to confirm.

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
                to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
                to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
               ).show(1, False)

df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate', 
                to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
                to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
               ).show(1, False)

spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate', 
                to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')          
               ).show(1, False)

df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate', 
                to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')           
               ).show(1, False)
+----------------------------+-------------------+-------------------+
|mydate                      |datetime           |LEGACYdatetime     |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-24 00:00:00|2019-05-24 00:00:00|
+----------------------------+-------------------+-------------------+

+----------------------------+-------------------+-------------------+
|mydate                      |datetime           |LEGACYdatetime     |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|2019-05-24 01:00:00|
+----------------------------+-------------------+-------------------+

+----------------------------+-------------------+
|mydate                      |datetime           |
+----------------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-23 14:00:00|
+----------------------------+-------------------+

+----------------------------+-------------------+
|mydate                      |datetime           |
+----------------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|
+----------------------------+-------------------+

Upvotes: 5

Shivam Tripathi
Shivam Tripathi

Reputation: 401

If you want to use the legacy format in a newer version of spark(>3), you need to set spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") or spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"), which will resolve the issue.

Upvotes: 30

Related Questions