Reputation: 2112
I have a dataframe like so:
+-------+-------------------+
|id |scandatetime |
+-------+-------------------+
|1234567|2020-03-13 10:56:18|
|1234567|2020-03-12 17:09:48|
|1234567|2020-03-12 15:42:25|
|1234567|2020-03-09 16:30:22|
|1234567|2020-03-12 17:09:48|
|1234567|2020-03-09 16:30:22|
|1234567|2020-03-12 15:42:25|
+-------+-------------------+
And I would like to calculate the minimum and maximum timestamps for this id. To do so, I have used the following code:
dfScans = datasource1.toDF()
dfScans = dfScans.withColumn('scandatetime',f.unix_timestamp(f.col('scandatetime'), "yyyy-MM-dd hh:mm:ss").cast("timestamp"))
dfDateAgg = dfScans.groupBy("id").agg(f.min('scandatetime').alias('FirstScanDate'),
f.max('scandatetime').alias('LastScanDate'))
But I am getting the following return:
+-------+-------------------+-------------------+
|id |FirstScanDate |LastScanDate |
+-------+-------------------+-------------------+
|1234567|2020-03-13 10:56:18|2020-03-13 10:56:18|
+-------+-------------------+-------------------+
Why is the min function not returning the right value?
Upvotes: 1
Views: 2898
Reputation: 43544
Your timestamps have hours in the 0-23 range, and thus you are using the wrong date format. You should be using "yyyy-MM-dd HH:mm:ss"
(capital H
) (See docs for SimpleDateFormat
).
The lowercase h
refers to hours in the 1-12 range, and thus all values except "2020-03-13 10:56:18"
become null
upon conversion to timestamp
.
from pyspark.sq import functions as f
dfScans = dfScans.withColumn(
'scandatetime',
f.unix_timestamp(
f.col('scandatetime'),
"yyyy-MM-dd HH:mm:ss"
).cast("timestamp")
)
dfScans.groupBy("id").agg(f.min('scandatetime').alias('FirstScanDate'),
f.max('scandatetime').alias('LastScanDate')).show()
#+-------+-------------------+-------------------+
#| id| FirstScanDate| LastScanDate|
#+-------+-------------------+-------------------+
#|1234567|2020-03-09 16:30:22|2020-03-13 10:56:18|
#+-------+-------------------+-------------------+
Upvotes: 3