DBA108642
DBA108642

Reputation: 2112

PySpark using Min function on timestamp selecting the wrong value

I have a dataframe like so:

+-------+-------------------+
|id     |scandatetime       |
+-------+-------------------+
|1234567|2020-03-13 10:56:18|
|1234567|2020-03-12 17:09:48|
|1234567|2020-03-12 15:42:25|
|1234567|2020-03-09 16:30:22|
|1234567|2020-03-12 17:09:48|
|1234567|2020-03-09 16:30:22|
|1234567|2020-03-12 15:42:25|
+-------+-------------------+

And I would like to calculate the minimum and maximum timestamps for this id. To do so, I have used the following code:

dfScans = datasource1.toDF()
dfScans = dfScans.withColumn('scandatetime',f.unix_timestamp(f.col('scandatetime'), "yyyy-MM-dd hh:mm:ss").cast("timestamp"))


dfDateAgg = dfScans.groupBy("id").agg(f.min('scandatetime').alias('FirstScanDate'),
    f.max('scandatetime').alias('LastScanDate'))

But I am getting the following return:

+-------+-------------------+-------------------+
|id     |FirstScanDate      |LastScanDate       |
+-------+-------------------+-------------------+
|1234567|2020-03-13 10:56:18|2020-03-13 10:56:18|
+-------+-------------------+-------------------+

Why is the min function not returning the right value?

Upvotes: 1

Views: 2898

Answers (1)

pault
pault

Reputation: 43544

Your timestamps have hours in the 0-23 range, and thus you are using the wrong date format. You should be using "yyyy-MM-dd HH:mm:ss" (capital H) (See docs for SimpleDateFormat).

The lowercase h refers to hours in the 1-12 range, and thus all values except "2020-03-13 10:56:18" become null upon conversion to timestamp.

from pyspark.sq import functions as f

dfScans = dfScans.withColumn(
    'scandatetime',
    f.unix_timestamp(
        f.col('scandatetime'), 
        "yyyy-MM-dd HH:mm:ss"
    ).cast("timestamp")
)

dfScans.groupBy("id").agg(f.min('scandatetime').alias('FirstScanDate'),
    f.max('scandatetime').alias('LastScanDate')).show()

#+-------+-------------------+-------------------+
#|     id|      FirstScanDate|       LastScanDate|
#+-------+-------------------+-------------------+
#|1234567|2020-03-09 16:30:22|2020-03-13 10:56:18|
#+-------+-------------------+-------------------+

Upvotes: 3

Related Questions