Reputation: 15
I have dataset like this
+--------------+---------+
|flavor |type |
+--------------+---------+
|chocolate |ice cream|
|strawberry | cake |
|chocolate | cake |
|chocolate |ice cream|
|strawberry |ice cream|
+--------------+---------+
I need to create another column("total") with the total number of
- flavors regardless of the type and
- flavor and type combined
All of this in the same table which will have 3 columns like below:
+--------------+---------+--------+
|flavor |type |total |
+--------------+---------+--------+
|chocolate |ice cream| 2 |
|chocolate | cake | 1 |
|chocolate | null | 3 |
|strawberry | cake | 1 |
|strawberry |ice cream| 1 |
|strawberry | null | 2 |
+--------------+---------+--------+
I tried this code df.groupBy("flavor,"type").agg(F.count(df['flavor']).alias("total"))
Upvotes: 0
Views: 66
Reputation: 2655
You can use rollup
df.rollup('flavor','type').count() \
.where('flavor is not null or type is not null') \
.selectExpr('flavor', 'type', 'count as total') \
.show()
# Where cluase is to remove total count=5 (flavor and type both null)
+----------+--------+-----+
| flavor| type|total|
+----------+--------+-----+
| chocolate| cake| 1|
| chocolate| null| 3|
|strawberry| cake| 1|
| chocolate|icecream| 2|
|strawberry| null| 2|
|strawberry|icecream| 1|
+----------+--------+-----+
Upvotes: 1