Reputation: 1117
I have the below spark dataframe/dataset.
Date Name Timestamp
2021-08-01 xxx 2021-08-01 07:10:00
2021-08-01 xxx 2021-08-01 10:10:00
2021-08-01 yyy 2021-08-01 07:10:00
2021-08-01 yyy 2021-08-01 10:10:00
2021-08-02 xxx 2021-08-02 07:15:00
2021-08-02 yyy 2021-08-02 07:15:00
In the above dataframe, for same set of date and name if I have more than 1 record, I have to sort the timestamp descending and retain only the first row and drop the rest of rows for the date and name. I am not sure if order by descending and dropDuplicates() would retain the first record and discard the rest.
Is there a way to achieve this in pyspark.
Expected output is below.
Date Name Timestamp
2021-08-01 xxx 2021-08-01 10:10:00
2021-08-01 yyy 2021-08-01 10:10:00
2021-08-02 xxx 2021-08-02 07:15:00
2021-08-02 yyy 2021-08-02 07:15:00
Upvotes: 0
Views: 1271
Reputation: 7207
Window function with "row_number()" can be used:
val df = Seq(
("2021-08-01", "xxx", "2021-08-01 07:10:00"),
("2021-08-01", "xxx", "2021-08-01 10:10:00"),
("2021-08-01", "yyy", "2021-08-01 07:10:00"),
("2021-08-01", "yyy", "2021-08-01 10:10:00"),
("2021-08-02", "xxx", "2021-08-02 07:15:00"),
("2021-08-02", "yyy", "2021-08-02 07:15:00")
).toDF("Date", "Name", "Timestamp")
val window = Window.partitionBy("Date", "Name").orderBy(desc("Timestamp"))
df
.withColumn("rownum", row_number().over(window))
.where($"rownum" === 1)
.drop("rownum")
Upvotes: 1