Reputation: 462
I am having trouble extracting the time and date from a timestamp. The data is read as a string, so I saw that people are able to use date_format
to essentially convert it into the format we want. Here is what I tried below
import pyspark.sql.functions as F
from pyspark.sql.functions import date_format
data = df_data.select(date_format(F.col("timestamp"),"MM/d/yy").alias("date"),
date_format(F.col("timestamp"),"HH:mm").alias("time"),
date_format(F.col("timestamp"), "M/d/yy").alias("current_date_formated")).show(10)
My input data looks like this:
timestamp 11/9/18 14:11 11/9/18 14:27 11/9/18 14:42 11/9/18 14:57
The output of my code above is spewed as under my date
and time
alias returns rows of null
. I have also tried using to_date
but didn't work either
import pyspark.sql.functions as F
data = df_data.select(date_format(F.col("timestamp"),"MM/d/yy").alias("date"),
F.to_date(F.col("timestamp"),"HH:mm").alias("time"),
F.to_date(F.col("timestamp"), "M/d/yy").alias("current_date_formated")).show(10)
Upvotes: 1
Views: 572
Reputation: 8410
You have the right idea with date_format
. The only thing your missing is it to first and foremost convert your string timestamp to Timestamp Type. And then you can do your formatting. I assumed your input data
is in the format of day-month-year, if not, you can switch it to "MM/dd/yy HH:mm"
.
from pyspark.sql import functions as F
df.withColumn("timestamp", F.to_timestamp("timestamp","dd/MM/yy HH:mm"))\
.withColumn("time",F.date_format("timestamp","HH:mm"))\
.withColumn("current_date_formated",F.date_format("timestamp", "M/d/yy")).show()
+-------------------+-----+---------------------+
| timestamp| time|current_date_formated|
+-------------------+-----+---------------------+
|2018-09-11 14:11:00|14:11| 9/11/18|
|2018-09-11 14:27:00|14:27| 9/11/18|
|2018-09-11 14:42:00|14:42| 9/11/18|
|2018-09-11 14:57:00|14:57| 9/11/18|
+-------------------+-----+---------------------+
Upvotes: 1