PySpark - Cast Long Epoch (in Milliseconds) to TimestampType with Native Spark Functions

Question

I'm using the PySpark library to read JSON files, process the data, and write back to parquet files.

The incoming data has a date field measured from the epoch in milliseconds. E.g., 1541106106796 represents: Thursday, November 1, 2018 9:01:46.796 PM.

A valid solution uses the Python datetime library:

def format_datetime(ts):
    return datetime.fromtimestamp(ts/1000.0)

...
get_timestamp = udf(lambda x: format_datetime(int(x)),TimestampType())
df = df.withColumn("timestamp", get_timestamp(df.ts))

Is there a solution that only uses native Spark functions?

notNull · Accepted Answer

use from_unixtime and extract milliseconds from timestamp then add at the end, finally cast to timestamp type.

df.show()
#+-------------+
#|           ts|
#+-------------+
#|1541106106796|
#+-------------+

df.withColumn("ts1",expr('concat_ws(".",from_unixtime(substring(ts,1,length(ts)-3),"yyyy-MM-dd HH:mm:ss"),substring(ts,length(ts)-2,length(ts)))').cast("timestamp")).\
show(10,False)
#+-------------+-----------------------+
#|ts           |ts1                    |
#+-------------+-----------------------+
#|1541106106796|2018-11-01 16:01:46.796|
#+-------------+-----------------------+

To create unixtime use unix_timestamp and regexp_extract functions.

Example:

df.show(10,False)
#+-----------------------------------------+
#|sample                                   |
#+-----------------------------------------+
#|Thursday, November 1, 2018 9:01:46.796 PM|
#+-----------------------------------------+

df.withColumn("ts",concat_ws('',unix_timestamp(col("sample"),"E, MMMM d, yyyy hh:mm:ss.SSS a"),regexp_extract(col("sample"),"\.(.*)\s+",1))).\
show(10,False)
#+-----------------------------------------+-------------+
#|sample                                   |ts           |
#+-----------------------------------------+-------------+
#|Thursday, November 1, 2018 9:01:46.796 PM|1541124106796|
#+-----------------------------------------+-------------+

PySpark - Cast Long Epoch (in Milliseconds) to TimestampType with Native Spark Functions

Answers (1)

Related Questions