Reputation: 77
I am trying to apply a groupBy statement on a set of IDs which are stamped with one or more (usually more) codes. The codes are all in the same column. Then I want to grab only those IDs that do have code1, but not code2.
The dataframe looks as follows:
| ID | Code |
| -------- | -------------- |
| 123| Code1|
| 123| Code2|
| 234| Code1|
| 567| Code1|
| 567| Code2|
I know this can be feasible by using left_anti join as well, but would like to be able to do this also in a groupby Statement
Anybody with any advice?
Upvotes: 0
Views: 24
Reputation: 42332
You can group by and collect a list of Code, then check the presence of code1 and code2:
import pyspark.sql.functions as F
df2 = df.groupBy('ID').agg(
(
F.array_contains(F.collect_list('Code'), 'Code1') &
~F.array_contains(F.collect_list('Code'), 'Code2')
).alias('flag')
)
df2.show()
+---+-----+
| ID| flag|
+---+-----+
|234| true|
|567|false|
|123|false|
+---+-----+
df2.filter('flag').show()
+---+----+
| ID|flag|
+---+----+
|234|true|
+---+----+
Upvotes: 1