Groupby and divide count of grouped elements in pyspark data frame

Question

I have a data frame in pyspark like below. I want to do groupby and count of category column in data frame

df.show()
+--------+----+
|category| val|
+--------+----+
|    cat1|  13|
|    cat2|  12|
|    cat2|  14|
|    cat3|  23|
|    cat1|  20|
|    cat1|  10|
|    cat2|  30|
|    cat3|  11|
|    cat1|   7|
|    cat1|   8|
+--------+----+


res = df.groupBy('category').count()

res.show()

+--------+-----+
|category|count|
+--------+-----+
|    cat2|    3|
|    cat3|    2|
|    cat1|    5|
+--------+-----+

I am getting my desired result. Now I want to calculate the average of category. data frame has records for 3 days. I want to calculate average of count for these 3 days.

The result I want is below. I basically want to do count/no.of.days

+--------+-----+
|category|count|
+--------+-----+
|    cat2|    1|
|    cat3|    1|
|    cat1|    2|
+--------+-----+

How can I do that?

Ryan Tam · Accepted Answer

I believe what you want is

from pyspark.sql import functions as F

df.groupby('category').agg((F.count('val') / 3).alias('average'))

Groupby and divide count of grouped elements in pyspark data frame

Answers (1)

Related Questions