BackInBlack
BackInBlack

Reputation: 109

PySpark datetime patterns with day-of-week

I'm having troubles parsing datetime strings containing week-of-day with to_timestamp function:

According to Spark documentation, string Fri, 23 Aug 2024 12:11:16 GMT should be parsed with pattern EEE, dd MMM yyyy HH:mm:ss 'GMT' however, I am getting SparkUpgradeException:

SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.DATETIME_PATTERN_RECOGNITION] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to recognize 'EEE, dd MMM yyyy HH:mm:ss \'GMT\'' pattern in the DateTimeFormatter.
1) You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0.
2) You can form a valid datetime pattern with the guide from 'https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html'.

The problem seems to be in the day-of-week part of the string, because pattern dd MMM yyyy HH:mm:ss 'GMT' works fine for 23 Aug 2024 12:11:16 GMT

Is there a way to parse this datetime string without setting LEGACY flag?

Spark version - 3.5.0, runtime - Databricks 14.3

Upvotes: 3

Views: 91

Answers (2)

Viktor Chekhovoi
Viktor Chekhovoi

Reputation: 82

The problem arises because the 'E' symbol cannot be used for timestamp parsing:

Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.

Since the day of the week doesn't matter for the timestamp, you should instead remove it with a regex:

df = df.withColumn("col_cleaned", regexp_replace("col", "^[A-Za-z]{3}, ", ""))

df = df.withColumn("col_parsed", to_timestamp("col_cleaned", "dd MMM yyyy HH:mm:ss 'GMT'"))

Upvotes: 1

Sachin Hosmani
Sachin Hosmani

Reputation: 1762

You need to pass Eee to match the case of the "Fri" in your example. But even then you will get an error like

IllegalArgumentException: All week-based patterns are unsupported since Spark 3.0, detected: e, Please use the SQL function EXTRACT instead

So it seems like this pattern is simply not supported. The only solution I could think of was to remove the day from the string like so

data = [("Fri, 23 Aug 2024 12:11:16 GMT",)]
df = spark.createDataFrame(data, ["datetime_string"])
df_parsed = df.withColumn(
    "parsed_datetime",
    to_timestamp(
        regexp_replace(col("datetime_string"), r'^[A-Za-z]+,\s', ''),
        "dd MMM yyyy HH:mm:ss 'GMT'"
    )
)

df_parsed.show(truncate=False)

+-----------------------------+-------------------+
|datetime_string              |parsed_datetime    |
+-----------------------------+-------------------+
|Fri, 23 Aug 2024 12:11:16 GMT|2024-08-23 12:11:16|
+-----------------------------+-------------------+

Upvotes: 1

Related Questions