Padfoot123
Padfoot123

Reputation: 1117

Spark - Order by Timestamp descending and drop if more than one timestamp is available for a date in dataframe

I have the below spark dataframe/dataset.

Date         Name    Timestamp
2021-08-01   xxx     2021-08-01 07:10:00
2021-08-01   xxx     2021-08-01 10:10:00
2021-08-01   yyy     2021-08-01 07:10:00
2021-08-01   yyy     2021-08-01 10:10:00
2021-08-02   xxx     2021-08-02 07:15:00
2021-08-02   yyy     2021-08-02 07:15:00

In the above dataframe, for same set of date and name if I have more than 1 record, I have to sort the timestamp descending and retain only the first row and drop the rest of rows for the date and name. I am not sure if order by descending and dropDuplicates() would retain the first record and discard the rest.

Is there a way to achieve this in pyspark.

Expected output is below.

Date         Name    Timestamp
2021-08-01   xxx     2021-08-01 10:10:00
2021-08-01   yyy     2021-08-01 10:10:00
2021-08-02   xxx     2021-08-02 07:15:00
2021-08-02   yyy     2021-08-02 07:15:00

Upvotes: 0

Views: 1271

Answers (1)

pasha701
pasha701

Reputation: 7207

Window function with "row_number()" can be used:

 val df = Seq(
  ("2021-08-01", "xxx", "2021-08-01 07:10:00"),
  ("2021-08-01", "xxx", "2021-08-01 10:10:00"),
  ("2021-08-01", "yyy", "2021-08-01 07:10:00"),
  ("2021-08-01", "yyy", "2021-08-01 10:10:00"),
  ("2021-08-02", "xxx", "2021-08-02 07:15:00"),
  ("2021-08-02", "yyy", "2021-08-02 07:15:00")
).toDF("Date", "Name", "Timestamp")

val window = Window.partitionBy("Date", "Name").orderBy(desc("Timestamp"))
df
  .withColumn("rownum", row_number().over(window))
  .where($"rownum" === 1)
  .drop("rownum")

Upvotes: 1

Related Questions