In pyspark, how do I find values in one column but not in another (set subtraction) after groupby?

Question

I have a pyspark datframe df of records, each record has id and group, and marks whether two events (event1, event2) have occurred. I want to find the number of ids in each group, that:

have had both events occurred to them,
have had event2 but not event1 occurred to them.

I am extracting a simple example here:

df:
|  id | event1 | event2 | group
| 001 |      1 |      0 |     A
| 001 |      1 |      0 |     A    
| 001 |      1 |      1 |     A  
| 002 |      0 |      1 |     A  
| 003 |      1 |      0 |     A  
| 003 |      1 |      1 |     A  
| ... |    ... |    ... |     B
...

in the above df, for group = A there are 2 ids have event1:(001,003), and 3 ids have event2:(001,002,003). So e.g., the number of ids in event2 but not event1 is 1.

I hope to get something like this.

group | event2_not_1 | event1_and_2 |
    A |            1 |            2 |
    B |          ... |          ... |

So far I have tried to collect a set of ids that appeared for each event, then perform set operations separately in new_df. But I felt this is rather clumsy. e.g.,

df_new = (
  df.withColumn('event1_id', when(col('event1') == 1, col('id')))
    .withColumn('event2_id', when(col('event2') == 1, col('id')))
    .groupby('group').agg(collect_set('event1_id').alias('has_event1'),
                          collect_set('event2_id').alias('has_event2'))
)

How do I achieve this elegantly in pyspark?

In pyspark, how do I find values in one column but not in another (set subtraction) after groupby?

Answers (1)

Related Questions