AliShahVis
AliShahVis

Reputation: 141

groupby category and sum the count

Let's say I have a table (df) like so:

    type count
    A    5000
    B    5000
    C    200
    D    123
    ...  ...
    ...  ...
    Z    453

How can I sum the column count by type A, B and all other types fall into Others category?

I currently have this:

df = df.withColumn('type', when(col("type").isnot("A", "B"))

My expected output would be like so:

type  count
A     5000
B     5000
Other 3043

Upvotes: 0

Views: 1274

Answers (2)

mck
mck

Reputation: 42352

You can divide the dataframe into two parts based on the type, aggregate a sum for the second part, and do a unionAll to combine them.

import pyspark.sql.functions as F

result = df.filter("type in ('A', 'B')").unionAll(
    df.filter("type not in ('A', 'B')")
      .select(F.lit('Other'), F.sum('count'))
)

result.show()
+-----+-----+
| type|count|
+-----+-----+
|    A| 5000|
|    B| 5000|
|Other|  776|
+-----+-----+

Upvotes: 1

blackbishop
blackbishop

Reputation: 32670

You want to group by when expression and sum the count :

from pyspark.sql import functions as F

df1 = df.groupBy(
    when(
        F.col("type").isin("A", "B"), F.col("type")
    ).otherwise("Others").alias("type")
).agg(
    F.sum("count").alias("count")
)
    
df1.show()

#+------+-----+
#|  type|count|
#+------+-----+
#|     B| 5000|
#|     A| 5000|
#|Others|  776|
#+------+-----+

Upvotes: 3

Related Questions