yuser099881232
yuser099881232

Reputation: 351

pyspark get value counts within a groupby

I have a dataframe df with a few columns. I want to groupby using one (or more) column and for every group, I want the count of values of another column(s).

Here's the df:

col1 col2  col3  col4
1     1      a    2
1     1      b    1
1     2      c    1
2     1      a    3
2     1      b    4

I want to groupby 'col1' and 'col2' and then for every group, the count of unique values in a column and then sum/mean/min/max of other column. I also want to maintain the grouped columns. the result should be:

col1 col2  count_a count_b count_c col4_sum
1     1      1        1      0       3
1     2      0        0      1       1
2     1      1        1      0       7

how do I achieve this?

Upvotes: 3

Views: 2519

Answers (1)

Vincent Doba
Vincent Doba

Reputation: 5078

You have two solutions

First, you can use pivot on col3 to get your count of unique values, and then join your pivoted dataframe with an aggregated dataframe that compute the sum/mean/min/max of other column.

Your code would be as follows:

from pyspark.sql import functions as F

result = df \
    .groupBy('col1', 'col2') \
    .pivot('col3') \
    .agg(F.count('col3')) \
    .fillna(0) \
    .join(
        df.groupby('col1', 'col2').agg(F.sum('col4').alias('col4_sum')),
        ['col1', 'col2']
    )

And with your input dataframe, you will get:

+----+----+---+---+---+--------+
|col1|col2|a  |b  |c  |col4_sum|
+----+----+---+---+---+--------+
|1   |1   |1  |1  |0  |3       |
|1   |2   |0  |0  |1  |1       |
|2   |1   |1  |1  |0  |7       |
+----+----+---+---+---+--------+

However, you can't choose the name of columns extracted from pivot, it will be the name of the value.

If you really want to choose the name of the columns, you can retrieve all distinct values first and then build your aggregation column from each of them, as follows:

from pyspark.sql import functions as F

values = map(lambda x: x.col3, df.select("col3").distinct().collect())

count_of_distinct_values = [F.sum((F.col('col3') == i).cast('integer')).alias('count_' + i) for i in values]
other_column_aggregations = [F.sum('col4').alias('col4_sum')]
aggregated = count_of_distinct_values + other_column_aggregations

result = df.groupBy('col1', 'col2').agg(*aggregated)

You then get the following dataframe:

+----+----+-------+-------+-------+--------+
|col1|col2|count_a|count_b|count_c|col4_sum|
+----+----+-------+-------+-------+--------+
|1   |1   |1      |1      |0      |3       |
|1   |2   |0      |0      |1      |1       |
|2   |1   |1      |1      |0      |7       |
+----+----+-------+-------+-------+--------+

Upvotes: 2

Related Questions