how to find sum and count of duplicates values in pyspark?

Question

i have dataframe dd1

colA    colB    Total   
 A       A        12
 A       A         1
 B       B        45
 B       B         0
 B       B         5
 C       C         1
 D       D         12

and i want output like this dd2:

colA    colB    count  Total   
 A       A        2      13
 B       B        3      50
 C       C        1       1
 D       D        1      12

In count column the value is how many time it occurred and Total column contain sum of Total.

Shantanu Sharma · Accepted Answer

Try this -

from pyspark.sql import functions as F

dd2 = dd1.groupBy('colA','colA').agg(F.count('colA').alias('count'),F.sum('Total').alias('Total'))

how to find sum and count of duplicates values in pyspark?

Answers (1)

Related Questions