Avarage per group in PySpark

Question

I have PySpark dataframe below:

cust |  amount |  
----------------
A    |  5      |  
A    |  1      |
A    |  3      |     
B    |  4      |     
B    |  4      |     
B    |  2      |     
C    |  2      |     
C    |  1      |     
C    |  7      |     
C    |  5      |

I need to group by column 'cust' and calculates the average per group.

Expected result:

cust |  avg_amount
-------------------
A    |  3
B    |  3.333
C    |  7.5

I've been using the code as below but giving me the error.

data.withColumn("avg_amount", F.avg("amount"))

Any idea how I can make this average?

blackbishop · Accepted Answer

Use groupBy to count the number of transactions and the average of amount by customer:

from pyspark.sql import functions as F

data = data.groupBy("cust")\
           .agg(
               F.count("*").alias("amount"),
               F.avg("amount").alias("avg_amount")
           )

Avarage per group in PySpark

Answers (1)

Related Questions