Reputation: 450
I want to convert an epoc field into a UTC timestamp and I am using the code below.
df.withColumn(
TargetColumn,
to_timestamp(from_unixtime(col(SourceColumn) / 1000), "yyyy-MM-dd HH:mm:ss")
A sample value for the SourceColumn is 1580452395095L
which should yield 2020-01-31 06:33:15.0
however when I try this code it gives 2020-01-31 07:33:15.0
Seems like to_timestamp function uses the local timezone of the system, I even tried to set spark timezone as
spark.conf.set("spark.sql.session.timeZone", "UTC")
But this also didn't work, is there a way to make this code always convert to a UTC timestamp regardless of the environment it runs on?
Upvotes: 2
Views: 1079
Reputation: 8711
If you want to enable it only for a specific dataframe operation irrespective of how the spark session is created, then this should help.
Default would be
java.time.ZoneId.systemDefault
res50: java.time.ZoneId = Asia/Calcutta
The same would reflect when you query the spark configuration.
spark.sql("SET spark.sql.session.timeZone").show(false)
+--------------------------+-------------+
|key |value |
+--------------------------+-------------+
|spark.sql.session.timeZone|Asia/Calcutta|
+--------------------------+-------------+
Now the dataframe
val df = Seq((1580452395095L)).toDF("DATE")
Change it to UTC - London
spark.conf.set("spark.sql.session.timeZone","Europe/London")
Querying the config settings would show London
spark.sql("SET spark.sql.session.timeZone").show(false)
+--------------------------+-------------+
|key |value |
+--------------------------+-------------+
|spark.sql.session.timeZone|Europe/London|
+--------------------------+-------------+
Result:
df.withColumn("NEW_DATE", to_timestamp(from_unixtime(col("DATE") / 1000))).show(false)
+-------------+-------------------+
|DATE |NEW_DATE |
+-------------+-------------------+
|1580452395095|2020-01-31 06:33:15|
+-------------+-------------------+
Changing it back to system default,
spark.conf.set("spark.sql.session.timeZone",java.time.ZoneId.systemDefault.toString)
df.withColumn("NEW_DATE", to_timestamp(from_unixtime(col("DATE") / 1000))).show(false)
+-------------+-------------------+
|DATE |NEW_DATE |
+-------------+-------------------+
|1580452395095|2020-01-31 12:03:15|
+-------------+-------------------+
spark.sql("SET spark.sql.session.timeZone").show(false)
+--------------------------+-------------+
|key |value |
+--------------------------+-------------+
|spark.sql.session.timeZone|Asia/Calcutta|
+--------------------------+-------------+
Upvotes: 2
Reputation: 23099
Setting the timezone in config should fix the issue. Actually setting up "spark.sql.session.timeZone", "UTC"
anywhere in config should work.
val spark = SparkSession.builder()
.appName("test")
.master("local")
.config("spark.sql.session.timeZone", "UTC").getOrCreate()
import spark.implicits._
val df = Seq(
(1580452395095L)
).toDF("DATE")
df.withColumn("NEW_DATE",
to_timestamp(from_unixtime(col("DATE") / 1000), "yyyy-MM-dd HH:mm:ss"))
.show(false)
Output:
+-------------+-------------------+
|DATE |NEW_DATE |
+-------------+-------------------+
|1580452395095|2020-01-31 06:33:15|
+-------------+-------------------+
Without setting the timeZone, we get
+-------------+-------------------+
|DATE |NEW_DATE |
+-------------+-------------------+
|1580452395095|2020-01-31 07:33:15|
+-------------+-------------------+
Upvotes: 3
Reputation: 42332
You can try appending the following line in your $SPARK_HOME/conf/spark-defaults.conf
file:
spark.sql.session.timeZone UTC
If necessary you can also set the JVM timezones using:
spark.driver.extraJavaOptions -Duser.timezone=GMT
spark.executor.extraJavaOptions -Duser.timezone=GMT
Upvotes: 1