How to write custom aggregate function in pandas that transforms a series?

So I have a dataframe like this

df = pd.DataFrame({'item_id':[1,2,3,4,5,6,7,8,9,10], 'category':['A', 'B', 'A', 'C', 'B', 'B', 'C', 'A', 'A', 'C'], 'sales': [100, 150, 300, 1000, 300, 50, 1000, 600, 700, 100]})

   item_id category  sales
0        1        A    100
1        2        B    150
2        3        A    300
3        4        C   1000
4        5        B    300
5        6        B     50
6        7        C   1000
7        8        A    600
8        9        A    700
9       10        C    100

and I want the cumulutative percent of total sales of every item, from most sold to least sold. Like this:

df = df.sort_values(by = 'sales', ascending = False)
df['pct_of_total'] = df['sales']/df['sales'].sum()
df['cumsum_pct_of_total'] = df['pct_of_total'].cumsum()

   item_id category  sales  pct_of_total  cumsum_pct_of_total
3        4        C   1000      0.232558             0.232558
6        7        C   1000      0.232558             0.465116
8        9        A    700      0.162791             0.627907
7        8        A    600      0.139535             0.767442
2        3        A    300      0.069767             0.837209
4        5        B    300      0.069767             0.906977
1        2        B    150      0.034884             0.941860
0        1        A    100      0.023256             0.965116
9       10        C    100      0.023256             0.988372
5        6        B     50      0.011628             1.000000

But the catch is that I want to this process not to the whole dataframe, but within each category. I tried a custom function:

def acc_pct(s):
  s = s.sort_values(ascending = False)
  s = s/s.sum()
  s = s.cumsum()
  return s.sort_index()

df.groupby('category').agg({'sales':acc_pct})

But it didn't work. It throws a ValueError: Must produce aggregated value.

I know it has to be possible, because groupby.cumcount(), groupby.cumsum() e groupby.shift() works much like this. How do I do it?

Upvotes: 1

Views: 609

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35646

Try dividing by groupby transform sum to get pct_of_total then groupby cumsum the new column:

df = df.sort_values('sales', ascending=False)
df['pct_of_total'] = (
        df['sales'] / df.groupby('category')['sales'].transform('sum')
)
df['cumsum_pct_of_total'] = df.groupby('category')['pct_of_total'].cumsum()

df:

   item_id category  sales  pct_of_total  cumsum_pct_of_total
3        4        C   1000      0.476190             0.476190
6        7        C   1000      0.476190             0.952381
8        9        A    700      0.411765             0.411765
7        8        A    600      0.352941             0.764706
2        3        A    300      0.176471             0.941176
4        5        B    300      0.600000             0.600000
1        2        B    150      0.300000             0.900000
0        1        A    100      0.058824             1.000000
9       10        C    100      0.047619             1.000000
5        6        B     50      0.100000             1.000000

Upvotes: 1

Related Questions