Reputation: 351
I have a dataframe df with a few columns. I want to groupby using one (or more) column and for every group, I want the count of values of another column(s).
Here's the df:
col1 col2 col3 col4
1 1 a 2
1 1 b 1
1 2 c 1
2 1 a 3
2 1 b 4
I want to groupby 'col1' and 'col2' and then for every group, the count of unique values in a column and then sum/mean/min/max of other column. I also want to maintain the grouped columns. the result should be:
col1 col2 count_a count_b count_c col4_sum
1 1 1 1 0 3
1 2 0 0 1 1
2 1 1 1 0 7
how do I achieve this?
Upvotes: 3
Views: 2519
Reputation: 5078
You have two solutions
First, you can use pivot on col3
to get your count of unique values, and then join your pivoted dataframe with an aggregated dataframe that compute the sum/mean/min/max of other column.
Your code would be as follows:
from pyspark.sql import functions as F
result = df \
.groupBy('col1', 'col2') \
.pivot('col3') \
.agg(F.count('col3')) \
.fillna(0) \
.join(
df.groupby('col1', 'col2').agg(F.sum('col4').alias('col4_sum')),
['col1', 'col2']
)
And with your input dataframe, you will get:
+----+----+---+---+---+--------+
|col1|col2|a |b |c |col4_sum|
+----+----+---+---+---+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+---+---+---+--------+
However, you can't choose the name of columns extracted from pivot, it will be the name of the value.
If you really want to choose the name of the columns, you can retrieve all distinct values first and then build your aggregation column from each of them, as follows:
from pyspark.sql import functions as F
values = map(lambda x: x.col3, df.select("col3").distinct().collect())
count_of_distinct_values = [F.sum((F.col('col3') == i).cast('integer')).alias('count_' + i) for i in values]
other_column_aggregations = [F.sum('col4').alias('col4_sum')]
aggregated = count_of_distinct_values + other_column_aggregations
result = df.groupBy('col1', 'col2').agg(*aggregated)
You then get the following dataframe:
+----+----+-------+-------+-------+--------+
|col1|col2|count_a|count_b|count_c|col4_sum|
+----+----+-------+-------+-------+--------+
|1 |1 |1 |1 |0 |3 |
|1 |2 |0 |0 |1 |1 |
|2 |1 |1 |1 |0 |7 |
+----+----+-------+-------+-------+--------+
Upvotes: 2