Reputation: 5725
I am creating a Spark (2.2.0) DataFrame by loading from a local file. The file load happens as expected and I get a following DF.
scala> df.show(4, false)
+--------+------------------+------------+----------------------------+
|userId |legacyProductId |optInFlag |transaction_date |
+--------+------------------+------------+----------------------------+
|71844441|805934 |null |Sat Oct 15 23:35:22 UTC 2005|
|71844441|714837 |null |Sat Apr 09 10:04:30 UTC 2005|
|71844441|732860 |null |Sat Mar 19 17:30:26 UTC 2005|
|71844441|1170951 |null |Sat Mar 19 17:30:26 UTC 2005|
+--------+------------------+------------+----------------------------+
only showing top 4 rows
The first two columns are integer
s, the last two are string
s. I want to convert the transaction_date
column to unix timestamp. I did the following.
val newdf = df.select($"userId", $"legacyProductId", $"OptInFlag", unix_timestamp($"transaction_date", "EEE MMM dd hh:mm:ss z yyyy"))
With that, I do get the last column as milliseconds. However, the conversion does not happen for all the rows, as shown below.
scala> newdf.show(4, false)
+--------+------------------+------------+------------------------------------------------------------+
|userId |legacyProductId |OptInFlag |unix_timestamp(transaction_date, EEE MMM dd hh:mm:ss z yyyy)|
+--------+------------------+------------+------------------------------------------------------------+
|71844441|805934 |null |null |
|71844441|714837 |null |1113041070 |
|71844441|732860 |null |null |
|71844441|1170951 |null |null |
+--------+------------------+------------+------------------------------------------------------------+
only showing top 4 rows
Only the second row timestamp gets converted successfully. The rest fail and are set to null.
Am I specifying the format string EEE MMM d hh:mm:ss z yyyy
correctly? How can I debug this?
Upvotes: 2
Views: 1413
Reputation: 166
That's because hh
is
Hour in am/pm (1-12)
You should use HH
:
Hour in day (0-23)
like
scala> spark.sql("SELECT unix_timestamp('Sat Mar 19 17:30:26 UTC 2005', 'EEE MMM dd HH:mm:ss zzz yyyy')").show
// +--------------------------------------------------------------------------+
// |unix_timestamp(Sat Mar 19 17:30:26 UTC 2005, EEE MMM dd HH:mm:ss zzz yyyy)|
// +--------------------------------------------------------------------------+
// | 1111253426|
// +--------------------------------------------------------------------------+
Upvotes: 2