Reputation: 7581
I am working with Pyspark and my input data contain a timestamp column (that contains timezone info) like that
2012-11-20T17:39:37Z
I want to create the America/New_York
representation of this timestamp. My understanding is that the best tool for that is from_utc_timestamp
. Although when I use it I get non-sensible results back.
F.from_utc_timestamp(F.col('ts'), 'America/New_York')
>>> datetime.datetime(2012, 11, 20, 7, 39, 37)
when it should be
datetime.datetime(2012, 11, 20, 12, 39, 37)
From from_utc_timestamp's doc I see
This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.
so I think that the timestamp containing tzinfo and not being naive is the culprit. But I can't find a good way to remove this information from the timestamp.
Disclaimer - 1. I don't want to rely on UDFs for this 2. I cannot change the SparkSession
timezone, as this is not a cluster dedicated to this job only.
Any ideas?
Upvotes: 7
Views: 36430
Reputation: 982
I tried to find a built-in Spark function for this, but gave up and ended up using a UDF. This was the only way I could find to make the time-zone correction while avoiding the problem where the built-in Spark functions return "confusing results" (actually would read "incorrect" results) if the input is a string with a timezone.
Here's what I did:
from pyspark.sql.functions import udf, col
import pytz
localTime = pytz.timezone("US/Eastern")
utc = pytz.timezone("UTC")
d2b_tzcorrection = udf(lambda x: localTime.localize(x).astimezone(utc), "timestamp")
Let df be a Spark DataFrame with a column named DateTime that contains values that Spark thinks are in UTC time zone when they actually represent a local time zone (US/Eastern in my case).
The time zone can be corrected as follows using the above-defined UDF.
df = df.withColumn("DateTime", d2b_tzcorrection(col("DateTime")))
Some other approaches I started out with did not correctly account for the different offsets that apply when in Daylight savings instead of standard time, but this approach properly handles that.
Upvotes: 1
Reputation: 5487
Pointing SparkSession
timezone to UTC
should give you the required result.
spark.conf.set('spark.sql.session.timeZone', 'UTC')
spark.sql("""select from_utc_timestamp('2012-11-20T17:39:37Z', 'America/New_York') as datetime""" ).show(truncate=False)
'''
+-------------------+
|datetime |
+-------------------+
|2012-11-20 12:39:37|
+-------------------+'''
OR, you can set timezone as America/New_York
and use to_timestamp()
.
spark.conf.set('spark.sql.session.timeZone', 'America/New_York')
spark.sql("""select to_timestamp('2012-11-20T17:39:37Z', "yyyy-MM-dd'T'HH:mm:ssz") as datetime""").show(truncate=False)
'''
+-------------------+
|datetime |
+-------------------+
|2012-11-20 12:39:37|
+-------------------+'''
Upvotes: 9