apache-sparkpysparkapache-spark-sqlspark3

Reputation: 7233

to_date fails to parse date in Spark 3.0

I am trying to parse date using to_date() but I get the following exception.

SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '12/1/2010 8:26' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.

The exception suggests I should use a legacy Time Parser, for starter I don't know how to set it to Legacy.

Here is my implementation

dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/dd/yyyy"))

my date is in following format

+--------------+
|   InvoiceDate|
+--------------+
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|

Upvotes: 33

Answers (7)

Sri Hari Adabala

Reputation: 1

Instead of using the legacy parser you could also update the date format from MM/dd/yyyy to M-d-yy. It worked for me

Upvotes: 0

RITA KUSHWAHA

Reputation: 496

One can use:

data = data.withColumn("Date", to_date(unix_timestamp("InvoiceDate", "MM/dd/yyyy").cast("timestamp")))

#result
    +--------------+----------+
    |   InvoiceDate|      Date|
    +--------------+----------+
    |12/1/2010 8:26|2010-12-01|

This works for me.

Upvotes: 1

xneg

Reputation: 1348

According to this in spark 3 you should use pattern "M/d/y". It works for me.

Upvotes: 13

blackbishop

Reputation: 32660

You can keep using the new implementation of spark 3 by parsing the string into timestamp first, than cast into a date :

from pyspark.sql import functions as F

dfWithDate = df.withColumn("date", F.to_date(F.to_timestamp(col("InvoiceDate"), "M/d/yyyy H:mm")))

dfWithDate.show()
#+--------------+----------+
#|   InvoiceDate|      date|
#+--------------+----------+
#| 2/1/2010 8:26|2010-02-01|
#| 2/1/2010 8:26|2010-02-01|
#| 2/1/2010 8:26|2010-02-01|
#| 2/1/2010 8:26|2010-02-01|
#|12/1/2010 8:26|2010-12-01|
#+--------------+----------+

Upvotes: 14

Felipe

Reputation: 7563

in case you want to keep using the Spark 3.0 version (not use the legacy version of time conversion), you can just use one digit of d in "MM/d/yyyy":

dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/d/yyyy"))

Upvotes: 5

Kay

Reputation: 31

Instead of using the legacy parser you could also update the date format from MM/dd/yyyy to MM-dd-yyyy

This is not a solution as it returns in NULL values

Upvotes: 3

suresiva

Reputation: 3173

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df.withColumn("date", to_date(col("InvoiceDate"), "MM/dd/yyyy")).show()


+--------------+----------+
|   InvoiceDate|      date|
+--------------+----------+
|12/1/2010 8:26|2010-12-01|
+--------------+----------+

# in above code spark refers SparkSession

Upvotes: 38

to_date fails to parse date in Spark 3.0

Answers (7)

Related Questions