Reputation: 7611
I am working with Pyspark and my input data contain a timestamp column (that contains timezone info) like that
2012-11-20T17:39:37Z
I want to create the America/New_York
representation of this timestamp. My understanding is that the best tool for that is from_utc_timestamp
. Although when I use it I get non-sensible results back.
F.from_utc_timestamp(F.col('ts'), 'America/New_York')
>>> datetime.datetime(2012, 11, 20, 7, 39, 37)
when it should be
datetime.datetime(2012, 11, 20, 12, 39, 37)
From from_utc_timestamp's doc I see
This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.
so I think that the timestamp containing tzinfo and not being naive is the culprit. But I can't find a good way to remove this information from the timestamp.
Disclaimer - 1. I don't want to rely on UDFs for this 2. I cannot change the SparkSession
timezone, as this is not a cluster dedicated to this job only.
Any ideas?
Upvotes: 8
Views: 37606
Reputation: 17546
For more complicated conversion from a non-utc timezone to another non-utc timezone, the right way of converting would be to first convert from source timezone to utc using to_utc_timestamp()
and then convert from utc to target timezone using from_utc_timestamp()
. This would be most accurate because it should take care of all fringe cases due to Daylight Savings etc.
Note the following Daylight Savings changes in NY, London & Paris in 2025.
Following code converts a New York (EST or EDT) timestamp to London (GMT or BST) and Paris (CET or CEST).
>>> from datetime import datetime
>>> from pyspark.sql import DataFrame, functions as F
>>>
>>> ts = datetime.fromisoformat
>>> DataFrame.s = lambda self: self.show(truncate=False)
>>>
>>> data2 = [
... (ts('2025-01-01 11:00:00'), 'EST -> GMT/CET',),
... (ts('2025-01-01 23:00:00'), 'EST -> GMT/CET',),
... (ts('2025-03-15 11:00:00'), 'EDT -> GMT/CET',),
... (ts('2025-03-15 23:00:00'), 'EDT -> GMT/CET',),
... (ts('2025-05-01 11:00:00'), 'EDT -> BST/CEST',),
... (ts('2025-05-01 23:00:00'), 'EDT -> BST/CEST',),
... ]
>>>
>>> def convert(df, from_tz, to_tz):
... df = df.withColumn('utc_time', F.to_utc_timestamp('ny_time', from_tz))
... df = df.withColumn(to_tz, F.from_utc_timestamp('utc_time', to_tz))
... df = df.withColumn('date_diff', F.date_diff(to_tz, 'ny_time'))
... df = df.withColumn('time_diff', F.col(to_tz) - F.col('ny_time'))
... df.s()
...
>>>
>>> df = spark.createDataFrame(data=data2, schema='ny_time: timestamp, note: string')
>>>
>>> convert(df, 'America/New_York', 'Europe/London')
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
|ny_time |note |utc_time |Europe/London |date_diff|time_diff |
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
|2025-01-01 11:00:00|EST -> GMT/CET |2025-01-01 16:00:00|2025-01-01 16:00:00|0 |INTERVAL '0 05:00:00' DAY TO SECOND|
|2025-01-01 23:00:00|EST -> GMT/CET |2025-01-02 04:00:00|2025-01-02 04:00:00|1 |INTERVAL '0 05:00:00' DAY TO SECOND|
|2025-03-15 11:00:00|EDT -> GMT/CET |2025-03-15 15:00:00|2025-03-15 15:00:00|0 |INTERVAL '0 04:00:00' DAY TO SECOND|
|2025-03-15 23:00:00|EDT -> GMT/CET |2025-03-16 03:00:00|2025-03-16 03:00:00|1 |INTERVAL '0 04:00:00' DAY TO SECOND|
|2025-05-01 11:00:00|EDT -> BST/CEST|2025-05-01 15:00:00|2025-05-01 16:00:00|0 |INTERVAL '0 05:00:00' DAY TO SECOND|
|2025-05-01 23:00:00|EDT -> BST/CEST|2025-05-02 03:00:00|2025-05-02 04:00:00|1 |INTERVAL '0 05:00:00' DAY TO SECOND|
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
>>>
>>> convert(df, 'America/New_York', 'Europe/Paris')
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
|ny_time |note |utc_time |Europe/Paris |date_diff|time_diff |
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
|2025-01-01 11:00:00|EST -> GMT/CET |2025-01-01 16:00:00|2025-01-01 17:00:00|0 |INTERVAL '0 06:00:00' DAY TO SECOND|
|2025-01-01 23:00:00|EST -> GMT/CET |2025-01-02 04:00:00|2025-01-02 05:00:00|1 |INTERVAL '0 06:00:00' DAY TO SECOND|
|2025-03-15 11:00:00|EDT -> GMT/CET |2025-03-15 15:00:00|2025-03-15 16:00:00|0 |INTERVAL '0 05:00:00' DAY TO SECOND|
|2025-03-15 23:00:00|EDT -> GMT/CET |2025-03-16 03:00:00|2025-03-16 04:00:00|1 |INTERVAL '0 05:00:00' DAY TO SECOND|
|2025-05-01 11:00:00|EDT -> BST/CEST|2025-05-01 15:00:00|2025-05-01 17:00:00|0 |INTERVAL '0 06:00:00' DAY TO SECOND|
|2025-05-01 23:00:00|EDT -> BST/CEST|2025-05-02 03:00:00|2025-05-02 05:00:00|1 |INTERVAL '0 06:00:00' DAY TO SECOND|
+-------------------+---------------+-------------------+-------------------+---------+-----------------------------------+
>>>
Upvotes: 0
Reputation: 992
I tried to find a built-in Spark function for this, but gave up and ended up using a UDF. This was the only way I could find to make the time-zone correction while avoiding the problem where the built-in Spark functions return "confusing results" (actually would read "incorrect" results) if the input is a string with a timezone.
Here's what I did:
from pyspark.sql.functions import udf, col
import pytz
localTime = pytz.timezone("US/Eastern")
utc = pytz.timezone("UTC")
d2b_tzcorrection = udf(lambda x: localTime.localize(x).astimezone(utc), "timestamp")
Let df be a Spark DataFrame with a column named DateTime that contains values that Spark thinks are in UTC time zone when they actually represent a local time zone (US/Eastern in my case).
The time zone can be corrected as follows using the above-defined UDF.
df = df.withColumn("DateTime", d2b_tzcorrection(col("DateTime")))
Some other approaches I started out with did not correctly account for the different offsets that apply when in Daylight savings instead of standard time, but this approach properly handles that.
Upvotes: 1
Reputation: 5487
Pointing SparkSession
timezone to UTC
should give you the required result.
spark.conf.set('spark.sql.session.timeZone', 'UTC')
spark.sql("""select from_utc_timestamp('2012-11-20T17:39:37Z', 'America/New_York') as datetime""" ).show(truncate=False)
'''
+-------------------+
|datetime |
+-------------------+
|2012-11-20 12:39:37|
+-------------------+'''
OR, you can set timezone as America/New_York
and use to_timestamp()
.
spark.conf.set('spark.sql.session.timeZone', 'America/New_York')
spark.sql("""select to_timestamp('2012-11-20T17:39:37Z', "yyyy-MM-dd'T'HH:mm:ssz") as datetime""").show(truncate=False)
'''
+-------------------+
|datetime |
+-------------------+
|2012-11-20 12:39:37|
+-------------------+'''
Upvotes: 9