Mo T
Mo T

Reputation: 450

casting UTC unixtime to timestamp in the same timezone

I want to convert an epoc field into a UTC timestamp and I am using the code below.

df.withColumn(
  TargetColumn,
  to_timestamp(from_unixtime(col(SourceColumn) / 1000), "yyyy-MM-dd HH:mm:ss")

A sample value for the SourceColumn is 1580452395095L which should yield 2020-01-31 06:33:15.0 however when I try this code it gives 2020-01-31 07:33:15.0

Seems like to_timestamp function uses the local timezone of the system, I even tried to set spark timezone as

spark.conf.set("spark.sql.session.timeZone", "UTC")

But this also didn't work, is there a way to make this code always convert to a UTC timestamp regardless of the environment it runs on?

Upvotes: 2

Views: 1079

Answers (3)

stack0114106
stack0114106

Reputation: 8711

If you want to enable it only for a specific dataframe operation irrespective of how the spark session is created, then this should help.

Default would be

java.time.ZoneId.systemDefault

res50: java.time.ZoneId = Asia/Calcutta

The same would reflect when you query the spark configuration.

spark.sql("SET spark.sql.session.timeZone").show(false)

+--------------------------+-------------+
|key                       |value        |
+--------------------------+-------------+
|spark.sql.session.timeZone|Asia/Calcutta|
+--------------------------+-------------+

Now the dataframe

val df = Seq((1580452395095L)).toDF("DATE")

Change it to UTC - London

spark.conf.set("spark.sql.session.timeZone","Europe/London")

Querying the config settings would show London

spark.sql("SET spark.sql.session.timeZone").show(false)

+--------------------------+-------------+
|key                       |value        |
+--------------------------+-------------+
|spark.sql.session.timeZone|Europe/London|
+--------------------------+-------------+

Result:

df.withColumn("NEW_DATE", to_timestamp(from_unixtime(col("DATE") / 1000))).show(false)

+-------------+-------------------+
|DATE         |NEW_DATE           |
+-------------+-------------------+
|1580452395095|2020-01-31 06:33:15|
+-------------+-------------------+

Changing it back to system default,

spark.conf.set("spark.sql.session.timeZone",java.time.ZoneId.systemDefault.toString)

df.withColumn("NEW_DATE", to_timestamp(from_unixtime(col("DATE") / 1000))).show(false)

+-------------+-------------------+
|DATE         |NEW_DATE           |
+-------------+-------------------+
|1580452395095|2020-01-31 12:03:15|
+-------------+-------------------+

spark.sql("SET spark.sql.session.timeZone").show(false)

+--------------------------+-------------+
|key                       |value        |
+--------------------------+-------------+
|spark.sql.session.timeZone|Asia/Calcutta|
+--------------------------+-------------+

Upvotes: 2

koiralo
koiralo

Reputation: 23099

Setting the timezone in config should fix the issue. Actually setting up "spark.sql.session.timeZone", "UTC" anywhere in config should work.

val spark = SparkSession.builder()
    .appName("test")
    .master("local")
    .config("spark.sql.session.timeZone", "UTC").getOrCreate()

import spark.implicits._

val df = Seq(
  (1580452395095L)
).toDF("DATE")

df.withColumn("NEW_DATE",
  to_timestamp(from_unixtime(col("DATE") / 1000), "yyyy-MM-dd HH:mm:ss"))
    .show(false)

Output:

+-------------+-------------------+
|DATE         |NEW_DATE           |
+-------------+-------------------+
|1580452395095|2020-01-31 06:33:15|
+-------------+-------------------+

Without setting the timeZone, we get

+-------------+-------------------+
|DATE         |NEW_DATE           |
+-------------+-------------------+
|1580452395095|2020-01-31 07:33:15|
+-------------+-------------------+

Upvotes: 3

mck
mck

Reputation: 42332

You can try appending the following line in your $SPARK_HOME/conf/spark-defaults.conf file:

spark.sql.session.timeZone UTC

If necessary you can also set the JVM timezones using:

spark.driver.extraJavaOptions -Duser.timezone=GMT
spark.executor.extraJavaOptions -Duser.timezone=GMT

Upvotes: 1

Related Questions