rina
rina

Reputation: 15

columns in pyspark

I have dataset like this

+--------------+---------+
|flavor        |type     |
+--------------+---------+
|chocolate     |ice cream|
|strawberry    |  cake   |
|chocolate     |  cake   |
|chocolate     |ice cream|
|strawberry    |ice cream|
+--------------+---------+

I need to create another column("total") with the total number of

  • flavors regardless of the type and
  • flavor and type combined

All of this in the same table which will have 3 columns like below:

+--------------+---------+--------+
|flavor        |type     |total   |
+--------------+---------+--------+
|chocolate     |ice cream|   2    |
|chocolate     |  cake   |   1    | 
|chocolate     |  null   |   3    |
|strawberry    |  cake   |   1    |
|strawberry    |ice cream|   1    |
|strawberry    | null    |   2    |
+--------------+---------+--------+

I tried this code df.groupBy("flavor,"type").agg(F.count(df['flavor']).alias("total"))

Upvotes: 0

Views: 66

Answers (1)

SMaZ
SMaZ

Reputation: 2655

You can use rollup

df.rollup('flavor','type').count() \
    .where('flavor is not null or type is not null') \
    .selectExpr('flavor', 'type', 'count as total') \
    .show()
# Where cluase is to remove total count=5 (flavor and type both null)

+----------+--------+-----+
|    flavor|    type|total|
+----------+--------+-----+
| chocolate|    cake|    1|
| chocolate|    null|    3|
|strawberry|    cake|    1|
| chocolate|icecream|    2|
|strawberry|    null|    2|
|strawberry|icecream|    1|
+----------+--------+-----+

Upvotes: 1

Related Questions