Reputation: 141
Let's say I have a table (df) like so:
type count
A 5000
B 5000
C 200
D 123
... ...
... ...
Z 453
How can I sum the column count
by type A
, B
and all other types fall into Others
category?
I currently have this:
df = df.withColumn('type', when(col("type").isnot("A", "B"))
My expected output would be like so:
type count
A 5000
B 5000
Other 3043
Upvotes: 0
Views: 1274
Reputation: 42352
You can divide the dataframe into two parts based on the type, aggregate a sum for the second part, and do a unionAll
to combine them.
import pyspark.sql.functions as F
result = df.filter("type in ('A', 'B')").unionAll(
df.filter("type not in ('A', 'B')")
.select(F.lit('Other'), F.sum('count'))
)
result.show()
+-----+-----+
| type|count|
+-----+-----+
| A| 5000|
| B| 5000|
|Other| 776|
+-----+-----+
Upvotes: 1
Reputation: 32670
You want to group by when expression and sum the count :
from pyspark.sql import functions as F
df1 = df.groupBy(
when(
F.col("type").isin("A", "B"), F.col("type")
).otherwise("Others").alias("type")
).agg(
F.sum("count").alias("count")
)
df1.show()
#+------+-----+
#| type|count|
#+------+-----+
#| B| 5000|
#| A| 5000|
#|Others| 776|
#+------+-----+
Upvotes: 3