Reputation: 323
python 3.7.10 pandas 1.1.5
Imagine we have Dataframe with two columns containing categories and third column with numbers. Task is to group by first category and then subgroup by second category and calculate totals and shares.
import pandas as pd
df = pd.DataFrame({
'fruit': ['orange', 'orange', 'orange', 'banana', 'banana', 'banana'],
'origin': ['USA', 'Canada', 'USA', 'Canada', 'USA', 'Canada'],
'weight': [1, 2, 3, 4, 5, 6]
})
df
fruit | origin | weight | |
---|---|---|---|
0 | orange | USA | 1 |
1 | orange | Canada | 2 |
2 | orange | USA | 3 |
3 | banana | Canada | 4 |
4 | banana | USA | 5 |
5 | banana | Canada | 6 |
(df
.groupby('fruit')
.apply(lambda x: (x
.groupby('origin')
.agg({'weight': sum})
.assign(share=lambda x: x.weight / x.weight.sum()))
)
)
fruit | origin | weight | share |
---|---|---|---|
banana | Canada | 10 | 0.666667 |
USA | 5 | 0.333333 | |
orange | Canada | 2 | 0.333333 |
USA | 4 | 0.666667 |
Is there a more pythonic / pandish / cleaner way to achieve the same result. For example, I can't rename weight on the fly in case it's not sum but rather count and I want column name to reflect this.
In R it looks to me much cleaner.
library(dplyr)
df <- tibble(
fruit = c('orange', 'orange', 'orange', 'banana', 'banana', 'banana'),
origin = c('USA', 'Canada', 'USA', 'Canada', 'USA', 'Canada'),
weight = c(1, 2, 3, 4, 5, 6)
)
df %>%
group_by(fruit, origin) %>%
summarise(total = sum(weight)) %>%
mutate(share = total / sum(total))
I believe there is some cleaner way to do it in python.
Upvotes: 4
Views: 203
Reputation: 15872
A direct translation from your r
code would need another groupby
:
>>> ( df.groupby(['fruit', 'origin'])
.sum().assign(
share=lambda x: x.groupby('fruit').transform(lambda x: x / x.sum())
)
)
weight share
fruit origin
banana Canada 10 0.666667
USA 5 0.333333
orange Canada 2 0.333333
USA 4 0.666667
Or,
>>> ( df.groupby(['fruit', 'origin'])
.sum().assign(share=lambda x: x / x.groupby('fruit').transform(sum))
)
weight share
fruit origin
banana Canada 10 0.666667
USA 5 0.333333
orange Canada 2 0.333333
USA 4 0.666667
Or, probably the most readable:
>>> ( df.groupby(['fruit', 'origin']).sum()
.assign(share=lambda x: x.div(df.groupby('fruit').sum()))
)
weight share
fruit origin
banana Canada 10 0.666667
USA 5 0.333333
orange Canada 2 0.333333
USA 4 0.666667
Even better with rdiv
, and finally truly one-liner :) :
>>> df.groupby(['fruit', 'origin']).sum().assign(share=df.groupby('fruit').sum().rdiv)
weight share
fruit origin
banana Canada 10 0.666667
USA 5 0.333333
orange Canada 2 0.333333
USA 4 0.666667
Something without groupby, using pd.melt
and pd.crosstab
:
>>> df2 = df.melt(['fruit', 'origin'], var_name='stats')
>>> pd.crosstab(
index=[df2['fruit'], df2['origin']],
columns=df2['stats'],
values=df2['value'],
aggfunc=sum
).assign(share=lambda x:x/x.sum(level=0))
stats weight share
fruit origin
banana Canada 10 0.666667
USA 5 0.333333
orange Canada 2 0.333333
USA 4 0.666667
Upvotes: 4
Reputation: 20669
You can use .set_index
then use .div
here.
Out = df.groupby(["fruit", "origin"]).sum()
Out = Out.assign(share=Out.div(df.set_index(["fruit", "origin"]).sum(level=0)))
weight share
fruit origin
banana Canada 10 0.666667
USA 5 0.333333
orange Canada 2 0.333333
USA 4 0.666667
Upvotes: 2
Reputation: 13417
In your R code you transformed the sum(weight)
to rename it as "total"
which you can do by passing in keyword arguments into a groupby(...).agg(new_name=("column_name", aggfunc)
. You can also gain some cleanliness by writing a helper function to perform the normalization with.
def normalize(x):
return x / x.sum()
out = (df.groupby(["fruit", "origin"])
.agg(total=("weight", "sum"))
.assign(
share=lambda df: df.groupby("fruit").transform(normalize)
))
print(out)
total share
fruit origin
banana Canada 10 0.666667
USA 5 0.333333
orange Canada 2 0.333333
USA 4 0.666667
Upvotes: 1
Reputation: 153500
This isn't as "clean" as R, but it can be done in one-liner:
df.groupby(['fruit', 'origin'])['weight'].sum().reset_index()\
.pipe(lambda x: x.assign(share=x['weight'] /
x.groupby('fruit')['weight'].transform('sum')))
Output:
fruit origin weight share
0 banana Canada 10 0.666667
1 banana USA 5 0.333333
2 orange Canada 2 0.333333
3 orange USA 4 0.666667
Upvotes: 3
Reputation: 34086
You can have two separate groupby
statements to make it cleaner:
In [101]: x = df.groupby(['fruit', 'origin']).sum().reset_index()
In [104]: x['share'] = x.groupby('fruit')['weight'].apply(lambda i: i/i.sum())
In [105]: x
Out[105]:
fruit origin weight share
0 banana Canada 10 0.666667
1 banana USA 5 0.333333
2 orange Canada 2 0.333333
3 orange USA 4 0.666667
OR, as per @Manakin's comment, avoiding apply:
In [101]: x = df.groupby(['fruit', 'origin']).sum().reset_index()
In [109]: x['share'] = x['weight'].div(x.groupby('fruit')['weight'].transform('sum'))
In [110]: x
Out[110]:
fruit origin weight share
0 banana Canada 10 0.666667
1 banana USA 5 0.333333
2 orange Canada 2 0.333333
3 orange USA 4 0.666667
Upvotes: 4