Reputation: 865
Lets say I have a Pyspark DataFrame like the following:
df = spark.createDataFrame([
("Jan", "a", "apple", 1),
("Jan", "a", "pear", 2),
("Jan", "b", "apple", 1),
("Feb", "c", "carrot", 3),
("Feb", "c", "apple", 1),
], ["month", id", "product", "price"])
+-----+---+-------+-----+
|month| id|product|price|
+-----+---+-------+-----+
| Jan| a| apple| 1|
| Jan| a| pear| 2|
| Jan| b| apple| 1|
| Feb| c| carrot| 3|
| Feb| c| apple| 1|
+-----+---+-------+-----+
Now I want to groupby month and product and aggregate on the sum of percentage of spending per customer. The desired output would look like that:
+-----+-------+-------+
|month|product|agg_val|
+-----+-------+-------+
| Jan| apple| 1.33|
| Jan| pear| 0.66|
| Feb| carrot| 0.75|
| Feb| apple| 0.25|
+-----+-------+-------+
where agg_vals are calculated as 1.33 = 1/(1+2) + 1/1, 0.66 = 2/(1+2), 0.75 = 3/(3+1) and 0.25 = 1/(3+1)
Is there a way to do that within a .agg()
? I can't figure out how to do that
Upvotes: 0
Views: 993
Reputation: 42352
You can add a column of the percentage per customer, and then do an aggregation:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'total',
F.col('price')/F.sum('price').over(Window.partitionBy('id', 'month'))
).groupBy('month', 'product').agg(F.sum('total').alias('total'))
df2.show()
+-----+-------+------------------+
|month|product| total|
+-----+-------+------------------+
| Jan| apple|1.3333333333333333|
| Jan| pear|0.6666666666666666|
| Feb| apple| 0.25|
| Feb| carrot| 0.75|
+-----+-------+------------------+
Upvotes: 1