Reputation: 117
I have a Data Frame that is structured like this:
ID | DATE | ACTIVE | LEFT | NEW |
123 |2021-01-01| 1 | 0 | 1 |
456 |2021-03-01| 1 | 0 | 1 |
456 |2021-06-01| 1 | 1 | 0 |
479 |2020-06-01| 1 | 1 | 0 |
567 |2021-07-01| 1 | 1 | 0 |
I want to implement a query in PySpark that returns all ID's that have both a condition where "NEW == 1" and "LEFT == 1", but those conditions appear in different rows.
So in this case, I'd like to return the following rows.
ID | DATE | ACTIVE | LEFT | NEW |
456 |2021-03-01| 1 | 0 | 1 |
456 |2021-06-01| 1 | 1 | 0 |
Thanks in advance!
ps: The original dataset has over 13 million entries.
Upvotes: 1
Views: 109
Reputation: 8302
Here is a solution you can give it a try, apply filter then groupby
to identify duplicates & inner join with the original dataframe
df_filter = df.filter((df.LEFT == 1) | (df.NEW == 1))
df_filter.join(
# Identify Duplicate ID's
df_filter.groupBy("ID").count().where("count > 1"), on=['ID']
).drop(*['count']).show()
+---+----------+------+----+---+
| ID| DATE|ACTIVE|LEFT|NEW|
+---+----------+------+----+---+
|456|2021-03-01| 1| 0| 1|
|456|2021-06-01| 1| 1| 0|
+---+----------+------+----+---+
Upvotes: 1