user1450410
user1450410

Reputation: 301

pyspark unix_timestamp sometime return null

for certain strings - unix_timestamp return null

raw_data.select(F.unix_timestamp(F.lit("2019-03-10T02:56:36Z"),format=date_format)).show(1)
+--------------------------------------------------------------+
|unix_timestamp(2019-03-10T02:56:36Z, yyyy-MM-dd'T'HH:mm:ss'Z')|
+--------------------------------------------------------------+
|                                                          null|
+--------------------------------------------------------------+
only showing top 1 row

but for almost the same string - i do get the answer:

+---------------------------------------------------------------------------------------------+
|unix_timestamp(to_utc_timestamp(2019-03-10T02:56:36Z, America/New_York), yyyy-MM-dd HH:mm:ss)|
+---------------------------------------------------------------------------------------------+
|                                                                                   1552204596|
+---------------------------------------------------------------------------------------------+
only showing top 1 row

when i first convert the problematic string to UTC timestamp - it works....

raw_data.select(F.unix_timestamp(F.to_utc_timestamp(F.lit("2019-03-10T02:56:36Z"), "America/New_York"))).show(1)
+---------------------------------------------------------------------------------------------+
|unix_timestamp(to_utc_timestamp(2019-03-10T02:56:36Z, America/New_York), yyyy-MM-dd HH:mm:ss)|
+---------------------------------------------------------------------------------------------+
|                                                                                   1552204596|
+---------------------------------------------------------------------------------------------+
only showing top 1 row

is it a problem in converting string with unix_timestamp ? how can i avoid converting to UTC ?

Upvotes: 1

Views: 816

Answers (1)

notNull
notNull

Reputation: 31490

Issue is with the timestamp 2019-03-10T02:56:36Z as daylight saving time starts on march 10 at 2AM.

  • On daylight savings time clock will be set to 3AM skipping 2AM entirely, So there is no 02:56 time recorded and that's the reason for resulting null.

  • When to use to_utc_timestamp you are converting timestamp to UTC i.e +5:00 and then time becomes 07:56 which is valid results unix_time without null.

  • To Fix the issue use regexp_replace inside unix_timestamp to replace 2019-03-10T02 to 2019-03-10T03.

Example:

#invalid time
spark.sql("""select unix_timestamp('2019-03-10T02:56:36Z', "yyyy-MM-dd'T'HH:mm:ss'Z'") unix_time""").show()
#+---------+
#|unix_time|
#+---------+
#|     null|
#+---------+

#valid time
spark.sql("""select unix_timestamp('2019-03-10T03:56:36Z', "yyyy-MM-dd'T'HH:mm:ss'Z'") unix_time""").show()
#+----------+
#| unix_time|
#+----------+
#|1552208196|
#+----------+

#to fix the issue use regexp replace to replace 2019-03-10T02 to 2019-03-10T03
spark.sql("""select unix_timestamp(regexp_replace('2019-03-10T02:56:36Z','2019-03-10T02','2019-03-10T03'), "yyyy-MM-
+----------+
| unix_time|
+----------+
|1552208196|
+----------+

Upvotes: 1

Related Questions