cs_guy
cs_guy

Reputation: 373

Conditionally counting from a column

I have some data stored in df1:

| id1 | id2      | product |
|-----|----------|---------|
| 1   | abc-2323 | Upload  |
| 1   | 234234   | Upload  |
| 1   | 43322    | Upload  |
| 2   | abc-449  | Upload  |
| 3   | abc-495  | Upload  |
| 3   | 432      | Upload  |
| 3   | 6543     | Upload  |

Per id1 and product, I want to count the distinct id2s based on weather they begin with abc- or not. I have tried a conditional aggregation like this:

agg_data = (
    df1
    .groupby('id1', 'product')
    .agg(
       sf.when(~ sf.col('id2').like('abc-%'), sf.countDistinct('id2')).alias('id2_count_without_abc'),
       sf.when(sf.col('id2').like('abc-%'), sf.countDistinct('id2')).alias('is2_count_with_abc')
    )
)

However this errors out that id2 is neither present in the group by, nor is it an aggregate function. I'm unsure why I cannot do this conditional since simply sf.countDistinct('id2') would work by itself. E.g. No problem with this:

agg_data = (
    df1
    .groupby('id1', 'product')
    .agg(
       sf.countDistinct('id2').alias('id2_count_without_abc'),
       sf.countDistinct('id2').alias('is2_count_with_abc')
    )
)

Upvotes: 4

Views: 61

Answers (1)

Mohana B C
Mohana B C

Reputation: 5487

When you do groupBy you will get many rows per group. Since you can get one value per group for countDistinct(), you won't get error. But for when clause you can pass one value at a time but group will return multiple rows.

To get rid of this error you can select one value using first() or last() (This will return first/last value of corresponding group) and pass that to when() function.

But you need all the values of group for count calculations based on one condition. Below code should do the trick.

df1.groupby('id1', 'product') \
.agg(sf.collect_list('id2').alias('id2')) \
.withColumn('id2_count_with_abc', sf.size(sf.expr("filter(id2, i-> i like 'abc-%')"))) \
.withColumn('id2_count_without_abc', sf.abs(sf.size(sf.col('id2')) - sf.col('id2_count_with_abc'))) \
.drop('id2').show()


+---+-------+------------------+---------------------+
|id1|product|id2_count_with_abc|id2_count_without_abc|
+---+-------+------------------+---------------------+
|  2| Upload|                 1|                    0|
|  1| Upload|                 1|                    2|
|  3| Upload|                 1|                    2|
+---+-------+------------------+---------------------+

Upvotes: 1

Related Questions