merkle
merkle

Reputation: 1815

conversion of string timestamp column to date datatype returning null values

I have a string date column in the pyspark dataframe as shown below.

df1=spark.createDataFrame(
data = [ ("8/30/2019 12:00:00 AM"),("8/29/2019 12:00:00 AM"),("8/27/2019 12:00:00 AM")], schema=["DateJoined"])

+---------------------+
|DateJoined           |
+---------------------+
|8/30/2019 12:00:00 AM|
|8/29/2019 12:00:00 AM|
|8/27/2019 12:00:00 AM

I am trying to convert it to date type using the below code but returning null

df = df.withColumn('date_col',
F.udf(lambda d: datetime.strptime(d, '%m/%d/%Y %I:%M %p').strftime('%m/%d/%Y %H:%M:%S'), T.StringType())(F.col('DateJoined')))

Can anyone help me with this?

Upvotes: 0

Views: 586

Answers (2)

Vaebhav
Vaebhav

Reputation: 5052

You can also utilise the native to_date function , which supports multiple parsings - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Data Preparation

df = pd.DataFrame({
    'DateJoined' : ["8/30/2019 12:00:00 AM","8/29/2019 12:00:00 AM","8/27/2019 12:00:00 AM"]
})

sparkDF = sql.createDataFrame(df)

sparkDF.show()

+---------------------+
|DateJoined           |
+---------------------+
|8/30/2019 12:00:00 AM|
|8/29/2019 12:00:00 AM|
|8/27/2019 12:00:00 AM|
+---------------------+

To Date

sparkDF = sparkDF.withColumn('DateJoined_parsed',F.to_date(F.col("DateJoined"),"M/dd/yyyy hh:mm:ss a"))

sparkDF.show(truncate=False)

+---------------------+-----------------+
|DateJoined           |DateJoined_parsed|
+---------------------+-----------------+
|8/30/2019 12:00:00 AM|2019-08-30       |
|8/29/2019 12:00:00 AM|2019-08-29       |
|8/27/2019 12:00:00 AM|2019-08-27       |
+---------------------+-----------------+

Upvotes: 0

wwnde
wwnde

Reputation: 26676

I wouldn't use a udf

  df=spark.createDataFrame([(1,'8/30/2019 12:00:00 AM'),
    (2,'8/30/2019 12:00:00 AM'),
    (3,'8/30/2019 12:00:00 AM')],
    ('id','DateJoined'))
    df.show(truncate=False)

    spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

df.withColumn("DateJoined",F.from_unixtime(F.unix_timestamp("DateJoined",'MM/dd/yyyy hh:mm:ss aa'),'MM/dd/yyyy HH:mm:ss')).show()

or

df.withColumn("DateJoined",F.from_unixtime(F.unix_timestamp("DateJoined",'MM/dd/yyyy hh:mm:ss aa'),'MM/dd/yyyy HH:mm:ss aa')).show(truncate=False)

+---+-------------------+
| id|         DateJoined|
+---+-------------------+
|  1|08/30/2019 00:00:00|
|  2|08/30/2019 00:00:00|
|  3|08/30/2019 00:00:00|
+---+-------------------+

Upvotes: 1

Related Questions