Reputation: 627
I have a df with a column having epoch time. The variable type of the epoch timecolumn is string. I want it to convert into Timestamp. I am using the following command
from pyspark.sql.functions import from_utc_timestamp
df = df.withColumn('start_time',from_unixtime(df.recognition_start_time,'UTC'))
df.select('recognition_start_time').show(10,False)
But the command is giving me null as a result. What is the mistake I am making?
Epoch Time : 1583471040000
Output expected :2020-03-06 05:04:00
Upvotes: 3
Views: 3335
Reputation: 2913
Instead of using a timestamp formatted as a StringType() I recommend casting directly to TimestampType() in PySpark.
from pyspark.sql.functions import from_utc_timestamp
from pyspark.sql.types import TimestampType
# Ensure UTC configuration on your cluster
self.spark.conf.set("spark.sql.session.timeZone", "UTC")
df = df.withColumn('start_time',
# since your epoch ts is in milliseconds, we need to divide by 1000.
(col("recognition_start_time") / lit(1000)).cast(TimestampType())
)
df.select('recognition_start_time').show(10,False)
Upvotes: 0
Reputation: 31460
In from_unixtime
we need to specify the expected output format(yyyy-MM-dd hh:mm:ss
) and by using from_utc_timestamp
we can convert the output from utc to given timezone.
Example:
spark.sql("""select from_utc_timestamp(from_unixtime("1583471040000"/1000,"yyyy-MM-dd hh:mm:ss"),"America/Chicago")""").show(false)
+---------------------+
|_c0 |
+---------------------+
|2020-03-05 05:04:00.0|
+---------------------+
Upvotes: 2