Nik
Nik

Reputation: 5725

Spark unable to parse timestamp fileds

I am creating a Spark (2.2.0) DataFrame by loading from a local file. The file load happens as expected and I get a following DF.

scala> df.show(4, false)
+--------+------------------+------------+----------------------------+
|userId  |legacyProductId   |optInFlag   |transaction_date            |
+--------+------------------+------------+----------------------------+
|71844441|805934            |null        |Sat Oct 15 23:35:22 UTC 2005|
|71844441|714837            |null        |Sat Apr 09 10:04:30 UTC 2005|
|71844441|732860            |null        |Sat Mar 19 17:30:26 UTC 2005|
|71844441|1170951           |null        |Sat Mar 19 17:30:26 UTC 2005|
+--------+------------------+------------+----------------------------+
only showing top 4 rows

The first two columns are integers, the last two are strings. I want to convert the transaction_date column to unix timestamp. I did the following.

val newdf = df.select($"userId", $"legacyProductId", $"OptInFlag", unix_timestamp($"transaction_date", "EEE MMM dd hh:mm:ss z yyyy"))

With that, I do get the last column as milliseconds. However, the conversion does not happen for all the rows, as shown below.

scala> newdf.show(4, false)
+--------+------------------+------------+------------------------------------------------------------+
|userId  |legacyProductId   |OptInFlag   |unix_timestamp(transaction_date, EEE MMM dd hh:mm:ss z yyyy)|
+--------+------------------+------------+------------------------------------------------------------+
|71844441|805934            |null        |null                                                        |
|71844441|714837            |null        |1113041070                                                  |
|71844441|732860            |null        |null                                                        |
|71844441|1170951           |null        |null                                                        |
+--------+------------------+------------+------------------------------------------------------------+
only showing top 4 rows

Only the second row timestamp gets converted successfully. The rest fail and are set to null.

Am I specifying the format string EEE MMM d hh:mm:ss z yyyy correctly? How can I debug this?

Upvotes: 2

Views: 1413

Answers (1)

user9964676
user9964676

Reputation: 166

That's because hh is

Hour in am/pm (1-12)

You should use HH:

Hour in day (0-23)

like

scala> spark.sql("SELECT unix_timestamp('Sat Mar 19 17:30:26 UTC 2005', 'EEE MMM dd HH:mm:ss zzz yyyy')").show
// +--------------------------------------------------------------------------+
// |unix_timestamp(Sat Mar 19 17:30:26 UTC 2005, EEE MMM dd HH:mm:ss zzz yyyy)|
// +--------------------------------------------------------------------------+
// |                                                                1111253426|
// +--------------------------------------------------------------------------+

Upvotes: 2

Related Questions