Select rows in a Data Frame where the ID must have two conditions based on two different rows in PySpark

Question

I have a Data Frame that is structured like this:

ID   |   DATE   |   ACTIVE   |  LEFT  |  NEW  |
123  |2021-01-01|      1     |    0   |   1   |
456  |2021-03-01|      1     |    0   |   1   |
456  |2021-06-01|      1     |    1   |   0   |
479  |2020-06-01|      1     |    1   |   0   |
567  |2021-07-01|      1     |    1   |   0   |

I want to implement a query in PySpark that returns all ID's that have both a condition where "NEW == 1" and "LEFT == 1", but those conditions appear in different rows.

So in this case, I'd like to return the following rows.

ID   |   DATE   |   ACTIVE   |  LEFT  |  NEW  |
456  |2021-03-01|      1     |    0   |   1   |
456  |2021-06-01|      1     |    1   |   0   |

Thanks in advance!

ps: The original dataset has over 13 million entries.

sushanth · Accepted Answer

Here is a solution you can give it a try, apply filter then groupby to identify duplicates & inner join with the original dataframe

df_filter = df.filter((df.LEFT == 1) | (df.NEW == 1))

df_filter.join(
    # Identify Duplicate ID's
    df_filter.groupBy("ID").count().where("count > 1"), on=['ID']
).drop(*['count']).show()

+---+----------+------+----+---+
| ID|      DATE|ACTIVE|LEFT|NEW|
+---+----------+------+----+---+
|456|2021-03-01|     1|   0|  1|
|456|2021-06-01|     1|   1|  0|
+---+----------+------+----+---+

Select rows in a Data Frame where the ID must have two conditions based on two different rows in PySpark

Answers (1)

Related Questions