Reputation: 1
I want to run a simple sql select of timestamp fields from my data using spark sql (pyspark). However, all the timestamp fields appear as 1970-01-19 10:45:37.009 . So looks like I have some conversion incompatibility between timestamp in Glue and in Spark.
I'm running with pyspark, and I have the glue catalog configuration so I get my database schema from Glue. In both Glue and the spark sql dataframe these columns appear with timestamp type.
However, it looks like when I read the parquet files from s3 path, the event_time column (for example) is of type long and when I get its data, I get a correct event_time as epoch in milliseconds = 1593938489000. So I can convert it and get the actual datetime.
But when I run spark.sql , the event_time column gets timestamp type but it isn’t useful and missing precision. So I get this = 1970-01-19 10:45:37.009 . When I run the same sql query in Athena, the timestamp field looks fine so my schema in Glue looks correct.
Is there a way to overcome it? I didn't manage to find any spark.sql configurations that solved it.
Upvotes: 0
Views: 1096
Reputation: 2003
You are getting 1970, due to incorrect way of formatting. Please give a try below code to convert long to UTC timestamp
from pyspark.sql import types as T
from pyspark.sql import functions as F
df = df.withColumn('timestamp_col_original', F.lit('1593938489000'))
df = df.withColumn('timestamp_col', (F.col('timestamp_col_original') / 1000).cast(T.TimestampType()))
df.show()
While converting : 1593938489000 I was getting below
timestamp_col_original| timestamp_col|
+----------------------+-------------------+
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
| 1593938489000|2020-07-05 08:41:29|
+----------------------+-------------------+
Upvotes: 1