groupby category and sum the count

Question

Let's say I have a table (df) like so:

    type count
    A    5000
    B    5000
    C    200
    D    123
    ...  ...
    ...  ...
    Z    453

How can I sum the column count by type A, B and all other types fall into Others category?

I currently have this:

df = df.withColumn('type', when(col("type").isnot("A", "B"))

My expected output would be like so:

type  count
A     5000
B     5000
Other 3043

blackbishop · Accepted Answer

You want to group by when expression and sum the count :

from pyspark.sql import functions as F

df1 = df.groupBy(
    when(
        F.col("type").isin("A", "B"), F.col("type")
    ).otherwise("Others").alias("type")
).agg(
    F.sum("count").alias("count")
)
    
df1.show()

#+------+-----+
#|  type|count|
#+------+-----+
#|     B| 5000|
#|     A| 5000|
#|Others|  776|
#+------+-----+

groupby category and sum the count

Answers (2)

Related Questions