Reputation: 139
I have a dataframe with two columns "date" (dtype: string)
and "modified" (dtype: bigint)
as shown below:
+-------------------------------------+-------------+
| date| modified|
+-------------------------------------+-------------+
|Mon, 18 Dec 2017 22:52:37 +0000 (UTC)|1513637587000|
| Mon, 18 Dec 2017 22:52:23 +0000|1513637587000|
| Mon, 18 Dec 2017 22:52:03 +0000|1513637587000|
|Mon, 18 Dec 2017 22:51:43 +0000 (UTC)|1513637527000|
| Mon, 18 Dec 2017 22:51:31 +0000|1513637527000|
| Mon, 18 Dec 2017 22:51:38 +0000|1513637527000|
| Mon, 18 Dec 2017 22:51:09 +0000|1513637526000|
| Mon, 18 Dec 2017 22:50:55 +0000|1513637466000|
| Mon, 18 Dec 2017 22:50:35 +0000|1513637466000|
| Mon, 18 Dec 2017 17:49:35 -0500|1513637407000|
+-------------------------------------+-------------+
How to extract YYYY-mm-dd (2017-12-18)
from any of the two columns? I tried using unix_timestamp
, and to_timestamp
but nothing worked. It gives null
value.
Upvotes: 3
Views: 4600
Reputation: 32720
Many questions have been posted here on how to convert strings to date in Spark (Convert pyspark string to date format, Convert date from String to Date format in Dataframes...).
You are getting null because the modified
column is epoch time in milliseconds, you need to divide it by 1000 to get seconds before converting it into a timestamp:
from pyspark.sql import functions as F
df1 = df.withColumn(
"modified_as_date",
F.to_timestamp(F.col("modified") / 1000).cast("date")
).withColumn(
"date_as_date",
F.to_date("date", "EEE, dd MMM yyyy HH:mm:ss")
)
df1.show(truncate=False)
#+-------------------------------------+-------------+----------------+------------+
#|date |modified |modified_as_date|date_as_date|
#+-------------------------------------+-------------+----------------+------------+
#|Mon, 18 Dec 2017 22:52:37 +0000 (UTC)|1513637587000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:52:23 +0000 |1513637587000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:52:03 +0000 |1513637587000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:51:43 +0000 (UTC)|1513637527000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:51:31 +0000 |1513637527000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:51:38 +0000 |1513637527000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:51:09 +0000 |1513637526000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:50:55 +0000 |1513637466000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 22:50:35 +0000 |1513637466000|2017-12-18 |2017-12-18 |
#|Mon, 18 Dec 2017 17:49:35 -0500 |1513637407000|2017-12-18 |2017-12-18 |
#+-------------------------------------+-------------+----------------+------------+
Upvotes: 1
Reputation: 42422
You can use from_unixtime
to cast the bigint unix timestamp into a timestamp type, and then cast to date type:
import pyspark.sql.functions as F
df2 = df.withColumn('parsed_date', F.from_unixtime(F.col('modified')/1000).cast('date'))
df2.show()
+--------------------+-------------+-----------+
| date| modified|parsed_date|
+--------------------+-------------+-----------+
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637526000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637466000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637466000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637407000| 2017-12-18|
+--------------------+-------------+-----------+
Upvotes: 3