Groupby two columns and aggregate as percent of one of the columns

Question

Lets say I have a Pyspark DataFrame like the following:

df = spark.createDataFrame([
  ("Jan", "a", "apple", 1),
  ("Jan", "a", "pear", 2),
  ("Jan", "b", "apple", 1),
  ("Feb", "c", "carrot", 3),
  ("Feb", "c", "apple", 1),
], ["month", id", "product", "price"])

+-----+---+-------+-----+
|month| id|product|price|
+-----+---+-------+-----+
|  Jan|  a|  apple|    1|
|  Jan|  a|   pear|    2|
|  Jan|  b|  apple|    1|
|  Feb|  c| carrot|    3|
|  Feb|  c|  apple|    1|
+-----+---+-------+-----+

Now I want to groupby month and product and aggregate on the sum of percentage of spending per customer. The desired output would look like that:

+-----+-------+-------+
|month|product|agg_val|
+-----+-------+-------+
|  Jan|  apple|   1.33|
|  Jan|   pear|   0.66|
|  Feb| carrot|   0.75|
|  Feb|  apple|   0.25|
+-----+-------+-------+

where agg_vals are calculated as 1.33 = 1/(1+2) + 1/1, 0.66 = 2/(1+2), 0.75 = 3/(3+1) and 0.25 = 1/(3+1)

Is there a way to do that within a .agg()? I can't figure out how to do that

mck · Accepted Answer

You can add a column of the percentage per customer, and then do an aggregation:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'total', 
    F.col('price')/F.sum('price').over(Window.partitionBy('id', 'month'))
).groupBy('month', 'product').agg(F.sum('total').alias('total'))

df2.show()
+-----+-------+------------------+
|month|product|             total|
+-----+-------+------------------+
|  Jan|  apple|1.3333333333333333|
|  Jan|   pear|0.6666666666666666|
|  Feb|  apple|              0.25|
|  Feb| carrot|              0.75|
+-----+-------+------------------+

Groupby two columns and aggregate as percent of one of the columns

Answers (1)

Related Questions