Roman Shevtsiv
Roman Shevtsiv

Reputation: 323

More effective / clean way to aggregate data

python 3.7.10 pandas 1.1.5

Imagine we have Dataframe with two columns containing categories and third column with numbers. Task is to group by first category and then subgroup by second category and calculate totals and shares.

import pandas as pd

df = pd.DataFrame({
    'fruit': ['orange', 'orange', 'orange', 'banana', 'banana', 'banana'],
    'origin': ['USA', 'Canada', 'USA', 'Canada', 'USA', 'Canada'],
    'weight': [1, 2, 3, 4, 5, 6]
})
df
fruit origin weight
0 orange USA 1
1 orange Canada 2
2 orange USA 3
3 banana Canada 4
4 banana USA 5
5 banana Canada 6
(df
 .groupby('fruit')
 .apply(lambda x: (x
                   .groupby('origin')
                   .agg({'weight': sum})
                   .assign(share=lambda x: x.weight / x.weight.sum()))
 )
)
fruit origin weight share
banana Canada 10 0.666667
USA 5 0.333333
orange Canada 2 0.333333
USA 4 0.666667

Is there a more pythonic / pandish / cleaner way to achieve the same result. For example, I can't rename weight on the fly in case it's not sum but rather count and I want column name to reflect this.

In R it looks to me much cleaner.

library(dplyr)

df <- tibble(
  fruit = c('orange', 'orange', 'orange', 'banana', 'banana', 'banana'),
  origin = c('USA', 'Canada', 'USA', 'Canada', 'USA', 'Canada'),
  weight = c(1, 2, 3, 4, 5, 6)
)

df %>%
  group_by(fruit, origin) %>%
  summarise(total = sum(weight)) %>%
  mutate(share = total / sum(total))

I believe there is some cleaner way to do it in python.

Upvotes: 4

Views: 203

Answers (5)

Sayandip Dutta
Sayandip Dutta

Reputation: 15872

A direct translation from your r code would need another groupby:

>>> ( df.groupby(['fruit', 'origin'])
        .sum().assign(
            share=lambda x: x.groupby('fruit').transform(lambda x: x / x.sum())
         )
     )
               weight     share
fruit  origin                  
banana Canada      10  0.666667
       USA          5  0.333333
orange Canada       2  0.333333
       USA          4  0.666667

Or,

>>> ( df.groupby(['fruit', 'origin'])
        .sum().assign(share=lambda x: x / x.groupby('fruit').transform(sum))
    )
 
               weight     share
fruit  origin                  
banana Canada      10  0.666667
       USA          5  0.333333
orange Canada       2  0.333333
       USA          4  0.666667

Or, probably the most readable:

>>> ( df.groupby(['fruit', 'origin']).sum()
        .assign(share=lambda x: x.div(df.groupby('fruit').sum()))
    )

               weight     share
fruit  origin                  
banana Canada      10  0.666667
       USA          5  0.333333
orange Canada       2  0.333333
       USA          4  0.666667

Even better with rdiv, and finally truly one-liner :) :

>>> df.groupby(['fruit', 'origin']).sum().assign(share=df.groupby('fruit').sum().rdiv)
               weight     share
fruit  origin                  
banana Canada      10  0.666667
       USA          5  0.333333
orange Canada       2  0.333333
       USA          4  0.666667

Something without groupby, using pd.melt and pd.crosstab:

>>> df2 = df.melt(['fruit', 'origin'], var_name='stats')
>>> pd.crosstab(
        index=[df2['fruit'], df2['origin']], 
        columns=df2['stats'], 
        values=df2['value'], 
        aggfunc=sum
    ).assign(share=lambda x:x/x.sum(level=0))

stats          weight     share
fruit  origin                  
banana Canada      10  0.666667
       USA          5  0.333333
orange Canada       2  0.333333
       USA          4  0.666667

Upvotes: 4

Ch3steR
Ch3steR

Reputation: 20669

You can use .set_index then use .div here.

Out = df.groupby(["fruit", "origin"]).sum()
Out = Out.assign(share=Out.div(df.set_index(["fruit", "origin"]).sum(level=0)))

               weight     share
fruit  origin                  
banana Canada      10  0.666667
       USA          5  0.333333
orange Canada       2  0.333333
       USA          4  0.666667

Upvotes: 2

Cameron Riddell
Cameron Riddell

Reputation: 13417

In your R code you transformed the sum(weight) to rename it as "total" which you can do by passing in keyword arguments into a groupby(...).agg(new_name=("column_name", aggfunc). You can also gain some cleanliness by writing a helper function to perform the normalization with.

def normalize(x):
    return x / x.sum()

out = (df.groupby(["fruit", "origin"])
         .agg(total=("weight", "sum"))
         .assign(
             share=lambda df: df.groupby("fruit").transform(normalize)
         ))

print(out)
               total     share
fruit  origin
banana Canada     10  0.666667
       USA         5  0.333333
orange Canada      2  0.333333
       USA         4  0.666667

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153500

This isn't as "clean" as R, but it can be done in one-liner:

df.groupby(['fruit', 'origin'])['weight'].sum().reset_index()\
  .pipe(lambda x: x.assign(share=x['weight'] / 
                                 x.groupby('fruit')['weight'].transform('sum')))

Output:

    fruit  origin  weight     share
0  banana  Canada      10  0.666667
1  banana     USA       5  0.333333
2  orange  Canada       2  0.333333
3  orange     USA       4  0.666667

Upvotes: 3

Mayank Porwal
Mayank Porwal

Reputation: 34086

You can have two separate groupby statements to make it cleaner:

In [101]: x = df.groupby(['fruit', 'origin']).sum().reset_index()
In [104]: x['share'] = x.groupby('fruit')['weight'].apply(lambda i: i/i.sum())

In [105]: x
Out[105]: 
    fruit  origin  weight     share
0  banana  Canada      10  0.666667
1  banana     USA       5  0.333333
2  orange  Canada       2  0.333333
3  orange     USA       4  0.666667

OR, as per @Manakin's comment, avoiding apply:

In [101]: x = df.groupby(['fruit', 'origin']).sum().reset_index()
In [109]: x['share'] = x['weight'].div(x.groupby('fruit')['weight'].transform('sum'))

In [110]: x
Out[110]: 
    fruit  origin  weight     share
0  banana  Canada      10  0.666667
1  banana     USA       5  0.333333
2  orange  Canada       2  0.333333
3  orange     USA       4  0.666667

Upvotes: 4

Related Questions