Timezone problem with spark.sql time functions

Question

i'm writing some code on a jupyter notebook using spark 2.4.7 and pyspark running in Standalone Mode.
I need to convert some timestamps to unix time to do some operations however i noticed a strange behavior,following the code i'm running:

import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from datetime import datetime, timedelta, date

spark = SparkSession.builder \
        .appName("test") \
        .master(n_spark_master)\
        .config("spark.total.executor.cores",n_spark_cores_max)\
        .config("spark.cores.max", n_spark_cores_max)\
        .config("spark.executor.memory",n_spark_executor_memory)\
        .config("spark.executor.cores",n_spark_executor_cores)\
        .enableHiveSupport() \
        .getOrCreate()

print(datetime.now().astimezone().tzinfo)

df = spark.createDataFrame([
    (1, "a"),
    (2, "b"),
    (3, "c"), ], ["dummy1", "dummy2"])

epoch = datetime.utcfromtimestamp(0) df=df.withColumn('epoch',lit(epoch))  
timeFmt = '%Y-%m-%dT%H:%M:%S'  
df= df.withColumn('unix_time_epoch',F.unix_timestamp('epoch', format=timeFmt)) df.show()

output:

CET
+------+------+-------------------+---------------+
|dummy1|dummy2|              epoch|unix_time_epoch|
+------+------+-------------------+---------------+
|     1|     a|1970-01-01 00:00:00|          -3600|
|     2|     b|1970-01-01 00:00:00|          -3600|
|     3|     c|1970-01-01 00:00:00|          -3600|
+------+------+-------------------+---------------+

from documentation of spark 2.4.7:

pyspark.sql.functions.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')[source]
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.

The previous command print(datetime.now().astimezone().tzinfo) which outputs CET should give me my local timezone, which is indeed the correct one on the machine since i'm in UTC+1.
On the UI of Spark also i can clearly see user.timezone=Europe/Rome.
Still it looks like spark is trying to convert from UTC+1 to UTC and therefore i get in the output unix_time_epoch = -3600, which i instead would expect it to be unix_time_epoch = 0.
I've tried to change to UTC as other thread suggest like this:

import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from datetime import datetime, timedelta, date
import time

os.environ['TZ'] = 'Europe/London'
time.tzset()

spark = SparkSession.builder \
        .appName("test") \
        .master(n_spark_master)\
        .config("spark.total.executor.cores",n_spark_cores_max)\
        .config("spark.cores.max", n_spark_cores_max)\
        .config("spark.executor.memory",n_spark_executor_memory)\
        .config("spark.executor.cores",n_spark_executor_cores)\
        .config('spark.driver.extraJavaOptions', '-Duser.timezone=UTC') \
        .config('spark.executor.extraJavaOptions', '-Duser.timezone=UTC') \
        .config('spark.sql.session.timeZone', 'UTC') \
        .enableHiveSupport() \
        .getOrCreate()

print(datetime.now().astimezone().tzinfo)

df = spark.createDataFrame([
    (1, "a"),
    (2, "b"),
    (3, "c"),
], ["dummy1", "dummy2"])

epoch = datetime.utcfromtimestamp(0)
df=df.withColumn('epoch',lit(epoch))
timeFmt = '%Y-%m-%dT%H:%M:%S'
df = df.withColumn('unix_time_epoch',F.unix_timestamp('epoch', format=timeFmt))
df.show()

but the output is:

GMT
+------+------+-------------------+---------------+
|dummy1|dummy2|              epoch|unix_time_epoch|
+------+------+-------------------+---------------+
|     1|     a|1969-12-31 23:00:00|          -3600|
|     2|     b|1969-12-31 23:00:00|          -3600|
|     3|     c|1969-12-31 23:00:00|          -3600|
+------+------+-------------------+---------------+

What i want to achieve is to evalueate everything in UTC regardless with not timezone offset since in Rome where i'm located local timezone shifts during the year between UTC+1 and UTC+2, expected output should be like this:

+------+------+-------------------+---------------+
|dummy1|dummy2|              epoch|unix_time_epoch|
+------+------+-------------------+---------------+
|     1|     a|1970-01-01 00:00:00|              0|
|     2|     b|1970-01-01 00:00:00|              0|
|     3|     c|1970-01-01 00:00:00|              0|
+------+------+-------------------+---------------+

mck · Accepted Answer

You should use os.environ['TZ'] = 'UTC' instead of Europe/London.

In 1970, the United Kingdom was performing a "British Standard Time experiment" where the timezone of the UK is GMT+1 throughout the year between 27 October 1968 and 31 October 1971. (source: wiki). That's why you got a time which is 1 hour earlier.

Timezone problem with spark.sql time functions

Answers (1)

Related Questions