Spark: filter in each group

Question

I have a dataframe like

+------+-------------------+------+
|group |               time| label|
+------+-------------------+------+
|     a|2020-01-01 10:49:00|first |
|     a|2020-01-01 10:51:00|second|
|     a|2020-01-01 12:49:00|first |
|     b|2020-01-01 12:44:00|second|
|     b|2020-01-01 12:46:00|first |
|     c|2020-01-01 12:46:00|third |
+------+-------------------+------+

I would like to drop all rows where, for each group, the label first is more recent than label second or third. For instance in group a the row with first and 2020-01-01 12:49:00 should be dropped as there's an older row with second label.

The desired output would be:

+------+-------------------+------+
|group |               time| label|
+------+-------------------+------+
|     a|2020-01-01 10:49:00|first |
|     a|2020-01-01 10:51:00|second|
|     b|2020-01-01 12:44:00|second|
|     c|2020-01-01 12:46:00|third |
+------+-------------------+------+

A window function with partition by group would filter inside each group, but how to implement the filter on label?

Spark: filter in each group

Answers (1)

Related Questions