How apply a different timezone to a timestamp in PySpark

I am working with Pyspark and my input data contain a timestamp column (that contains timezone info) like that

2012-11-20T17:39:37Z

I want to create the America/New_York representation of this timestamp. My understanding is that the best tool for that is from_utc_timestamp. Although when I use it I get non-sensible results back.

F.from_utc_timestamp(F.col('ts'), 'America/New_York')
>>> datetime.datetime(2012, 11, 20, 7, 39, 37)

when it should be

datetime.datetime(2012, 11, 20, 12, 39, 37)

From from_utc_timestamp's doc I see

This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.

so I think that the timestamp containing tzinfo and not being naive is the culprit. But I can't find a good way to remove this information from the timestamp.

Disclaimer - 1. I don't want to rely on UDFs for this 2. I cannot change the SparkSession timezone, as this is not a cluster dedicated to this job only.

Any ideas?

Upvotes: 8

Answers (3)

Kashyap

Reputation: 17546

For more complicated conversion from a non-utc timezone to another non-utc timezone, the right way of converting would be to first convert from source timezone to utc using to_utc_timestamp() and then convert from utc to target timezone using from_utc_timestamp(). This would be most accurate because it should take care of all fringe cases due to Daylight Savings etc.

Note the following Daylight Savings changes in NY, London & Paris in 2025.

09-Mar EST -> EDT
30-Mar GMT/CET -> BST/CEST
26-Oct BST/CEST -> GMT/CET
02-Nov EDT -> EST

Following code converts a New York (EST or EDT) timestamp to London (GMT or BST) and Paris (CET or CEST).

>>> from datetime import datetime
>>> from pyspark.sql import DataFrame, functions as F
>>>
>>> ts = datetime.fromisoformat
>>> DataFrame.s = lambda self: self.show(truncate=False)
>>>
>>> data2 = [
...     (ts('2025-01-01 11:00:00'), 'EST -> GMT/CET',),
...     (ts('2025-01-01 23:00:00'), 'EST -> GMT/CET',),
...     (ts('2025-03-15 11:00:00'), 'EDT -> GMT/CET',),
...     (ts('2025-03-15 23:00:00'), 'EDT -> GMT/CET',),
...     (ts('2025-05-01 11:00:00'), 'EDT -> BST/CEST',),
...     (ts('2025-05-01 23:00:00'), 'EDT -> BST/CEST',),
... ]
>>>
>>> def convert(df, from_tz, to_tz):
...     df = df.withColumn('utc_time', F.to_utc_timestamp('ny_time', from_tz))
...     df = df.withColumn(to_tz, F.from_utc_timestamp('utc_time', to_tz))
...     df = df.withColumn('date_diff', F.date_diff(to_tz, 'ny_time'))
...     df = df.withColumn('time_diff', F.col(to_tz) - F.col('ny_time'))
...     df.s()
...
>>>
>>> df = spark.createDataFrame(data=data2, schema='ny_time: timestamp, note: string')
>>>

>>> convert(df, 'America/New_York', 'Europe/London')
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
|ny_time            |note           |utc_time           |Europe/London      |date_diff|time_diff                          |
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
|2025-01-01 11:00:00|EST -> GMT/CET |2025-01-01 16:00:00|2025-01-01 16:00:00|0        |INTERVAL '0 05:00:00' DAY TO SECOND|
|2025-01-01 23:00:00|EST -> GMT/CET |2025-01-02 04:00:00|2025-01-02 04:00:00|1        |INTERVAL '0 05:00:00' DAY TO SECOND|
|2025-03-15 11:00:00|EDT -> GMT/CET |2025-03-15 15:00:00|2025-03-15 15:00:00|0        |INTERVAL '0 04:00:00' DAY TO SECOND|
|2025-03-15 23:00:00|EDT -> GMT/CET |2025-03-16 03:00:00|2025-03-16 03:00:00|1        |INTERVAL '0 04:00:00' DAY TO SECOND|
|2025-05-01 11:00:00|EDT -> BST/CEST|2025-05-01 15:00:00|2025-05-01 16:00:00|0        |INTERVAL '0 05:00:00' DAY TO SECOND|
|2025-05-01 23:00:00|EDT -> BST/CEST|2025-05-02 03:00:00|2025-05-02 04:00:00|1        |INTERVAL '0 05:00:00' DAY TO SECOND|
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
>>>

>>> convert(df, 'America/New_York', 'Europe/Paris')
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
|ny_time            |note           |utc_time           |Europe/Paris       |date_diff|time_diff                          |
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
|2025-01-01 11:00:00|EST -> GMT/CET |2025-01-01 16:00:00|2025-01-01 17:00:00|0        |INTERVAL '0 06:00:00' DAY TO SECOND|
|2025-01-01 23:00:00|EST -> GMT/CET |2025-01-02 04:00:00|2025-01-02 05:00:00|1        |INTERVAL '0 06:00:00' DAY TO SECOND|
|2025-03-15 11:00:00|EDT -> GMT/CET |2025-03-15 15:00:00|2025-03-15 16:00:00|0        |INTERVAL '0 05:00:00' DAY TO SECOND|
|2025-03-15 23:00:00|EDT -> GMT/CET |2025-03-16 03:00:00|2025-03-16 04:00:00|1        |INTERVAL '0 05:00:00' DAY TO SECOND|
|2025-05-01 11:00:00|EDT -> BST/CEST|2025-05-01 15:00:00|2025-05-01 17:00:00|0        |INTERVAL '0 06:00:00' DAY TO SECOND|
|2025-05-01 23:00:00|EDT -> BST/CEST|2025-05-02 03:00:00|2025-05-02 05:00:00|1        |INTERVAL '0 06:00:00' DAY TO SECOND|
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+

>>>

Upvotes: 0

BioData41

Reputation: 992

I tried to find a built-in Spark function for this, but gave up and ended up using a UDF. This was the only way I could find to make the time-zone correction while avoiding the problem where the built-in Spark functions return "confusing results" (actually would read "incorrect" results) if the input is a string with a timezone.

Here's what I did:

from pyspark.sql.functions import udf, col
import pytz

localTime = pytz.timezone("US/Eastern") 
utc = pytz.timezone("UTC") 
d2b_tzcorrection = udf(lambda x: localTime.localize(x).astimezone(utc), "timestamp")

Let df be a Spark DataFrame with a column named DateTime that contains values that Spark thinks are in UTC time zone when they actually represent a local time zone (US/Eastern in my case).
The time zone can be corrected as follows using the above-defined UDF.

df = df.withColumn("DateTime", d2b_tzcorrection(col("DateTime")))

Some other approaches I started out with did not correctly account for the different offsets that apply when in Daylight savings instead of standard time, but this approach properly handles that.

Upvotes: 1

Mohana B C

Reputation: 5487

Pointing SparkSession timezone to UTC should give you the required result.

spark.conf.set('spark.sql.session.timeZone', 'UTC')

spark.sql("""select from_utc_timestamp('2012-11-20T17:39:37Z', 'America/New_York') as datetime""" ).show(truncate=False)
'''
+-------------------+
|datetime           |
+-------------------+
|2012-11-20 12:39:37|
+-------------------+'''

OR, you can set timezone as America/New_York and use to_timestamp().

spark.conf.set('spark.sql.session.timeZone', 'America/New_York')
spark.sql("""select to_timestamp('2012-11-20T17:39:37Z', "yyyy-MM-dd'T'HH:mm:ssz") as datetime""").show(truncate=False)
'''
+-------------------+
|datetime           |
+-------------------+
|2012-11-20 12:39:37|
+-------------------+'''

Upvotes: 9

How apply a different timezone to a timestamp in PySpark

Answers (3)

Related Questions