Samiksha
Samiksha

Reputation: 139

Convert datetime to date on PySpark

I have a dataframe with two columns "date" (dtype: string) and "modified" (dtype: bigint) as shown below:

+-------------------------------------+-------------+
|                                 date|     modified|
+-------------------------------------+-------------+
|Mon, 18 Dec 2017 22:52:37 +0000 (UTC)|1513637587000|
|      Mon, 18 Dec 2017 22:52:23 +0000|1513637587000|
|      Mon, 18 Dec 2017 22:52:03 +0000|1513637587000|
|Mon, 18 Dec 2017 22:51:43 +0000 (UTC)|1513637527000|
|      Mon, 18 Dec 2017 22:51:31 +0000|1513637527000|
|      Mon, 18 Dec 2017 22:51:38 +0000|1513637527000|
|      Mon, 18 Dec 2017 22:51:09 +0000|1513637526000|
|      Mon, 18 Dec 2017 22:50:55 +0000|1513637466000|
|      Mon, 18 Dec 2017 22:50:35 +0000|1513637466000|
|      Mon, 18 Dec 2017 17:49:35 -0500|1513637407000|
+-------------------------------------+-------------+

How to extract YYYY-mm-dd (2017-12-18) from any of the two columns? I tried using unix_timestamp, and to_timestamp but nothing worked. It gives null value.

Upvotes: 3

Views: 4600

Answers (2)

blackbishop
blackbishop

Reputation: 32720

Many questions have been posted here on how to convert strings to date in Spark (Convert pyspark string to date format, Convert date from String to Date format in Dataframes...).

You are getting null because the modified column is epoch time in milliseconds, you need to divide it by 1000 to get seconds before converting it into a timestamp:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "modified_as_date",
    F.to_timestamp(F.col("modified") / 1000).cast("date")
).withColumn(
    "date_as_date",
    F.to_date("date", "EEE, dd MMM yyyy HH:mm:ss")
)

df1.show(truncate=False)

#+-------------------------------------+-------------+----------------+------------+
#|date                                 |modified     |modified_as_date|date_as_date|
#+-------------------------------------+-------------+----------------+------------+
#|Mon, 18 Dec 2017 22:52:37 +0000 (UTC)|1513637587000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:52:23 +0000      |1513637587000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:52:03 +0000      |1513637587000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:51:43 +0000 (UTC)|1513637527000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:51:31 +0000      |1513637527000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:51:38 +0000      |1513637527000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:51:09 +0000      |1513637526000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:50:55 +0000      |1513637466000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:50:35 +0000      |1513637466000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 17:49:35 -0500      |1513637407000|2017-12-18      |2017-12-18  |
#+-------------------------------------+-------------+----------------+------------+

Upvotes: 1

mck
mck

Reputation: 42422

You can use from_unixtime to cast the bigint unix timestamp into a timestamp type, and then cast to date type:

import pyspark.sql.functions as F

df2 = df.withColumn('parsed_date', F.from_unixtime(F.col('modified')/1000).cast('date'))

df2.show()
+--------------------+-------------+-----------+
|                date|     modified|parsed_date|
+--------------------+-------------+-----------+
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637526000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637466000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637466000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637407000| 2017-12-18|
+--------------------+-------------+-----------+

Upvotes: 3

Related Questions