GroupBy count applied to multiple statements for the same column

Question

I am trying to apply a groupBy statement on a set of IDs which are stamped with one or more (usually more) codes. The codes are all in the same column. Then I want to grab only those IDs that do have code1, but not code2.

The dataframe looks as follows:

| ID | Code |
| -------- | -------------- |
| 123| Code1|
| 123| Code2|
| 234| Code1|
| 567| Code1|
| 567| Code2|

I know this can be feasible by using left_anti join as well, but would like to be able to do this also in a groupby Statement

Anybody with any advice?

mck · Accepted Answer

You can group by and collect a list of Code, then check the presence of code1 and code2:

import pyspark.sql.functions as F

df2 = df.groupBy('ID').agg(
    (
        F.array_contains(F.collect_list('Code'), 'Code1') & 
        ~F.array_contains(F.collect_list('Code'), 'Code2')
    ).alias('flag')
)

df2.show()
+---+-----+
| ID| flag|
+---+-----+
|234| true|
|567|false|
|123|false|
+---+-----+

df2.filter('flag').show()
+---+----+
| ID|flag|
+---+----+
|234|true|
+---+----+

GroupBy count applied to multiple statements for the same column

Answers (1)

Related Questions