NatalyB
NatalyB

Reputation: 1

Conversion incompatibility between timestamp type in Glue and in Spark?

I want to run a simple sql select of timestamp fields from my data using spark sql (pyspark). However, all the timestamp fields appear as 1970-01-19 10:45:37.009 . So looks like I have some conversion incompatibility between timestamp in Glue and in Spark.

I'm running with pyspark, and I have the glue catalog configuration so I get my database schema from Glue. In both Glue and the spark sql dataframe these columns appear with timestamp type.

However, it looks like when I read the parquet files from s3 path, the event_time column (for example) is of type long and when I get its data, I get a correct event_time as epoch in milliseconds = 1593938489000. So I can convert it and get the actual datetime.

But when I run spark.sql , the event_time column gets timestamp type but it isn’t useful and missing precision. So I get this = 1970-01-19 10:45:37.009 . When I run the same sql query in Athena, the timestamp field looks fine so my schema in Glue looks correct.

Is there a way to overcome it? I didn't manage to find any spark.sql configurations that solved it.

Upvotes: 0

Views: 1096

Answers (1)

dsk
dsk

Reputation: 2003

You are getting 1970, due to incorrect way of formatting. Please give a try below code to convert long to UTC timestamp

from pyspark.sql import types as T
from pyspark.sql import functions as F
df = df.withColumn('timestamp_col_original', F.lit('1593938489000'))
df = df.withColumn('timestamp_col', (F.col('timestamp_col_original') / 1000).cast(T.TimestampType()))
df.show()

While converting : 1593938489000 I was getting below

timestamp_col_original|      timestamp_col|
+----------------------+-------------------+
|         1593938489000|2020-07-05 08:41:29|
|         1593938489000|2020-07-05 08:41:29|
|         1593938489000|2020-07-05 08:41:29|
|         1593938489000|2020-07-05 08:41:29|
+----------------------+-------------------+

Upvotes: 1

Related Questions