Multi index groupby count including NaN values in pandas and, calculate percentage by single groupby

Question

I have a df as shown below.

df:

Country       Player
Arg           Messi
Bra           Neymar
Arg           NaN
Arg           Messi
Arg           Aguero
Arg           Messi
Bra           Ronaldo
Spain         Xavi
Spain         NaN
Spain         NaN
Bra           Rivaldo
Spain         Iniesta
Bra           NaN
Spain         Xavi

Where NaN stands for information not available. From the above df, I would like to perform multiple groupby counts as shown below.

Expected output:

Country    Player    Counts    Percentage_of_country
Arg        NaN       1         20
Arg        Messi     3         60
Arg        Aguero    1         20
Bra        Neymar    1         25
Bra        NaN       1         25
Bra        Ronaldo   1         25
Bra        Rivaldo   1         25
Spain      NaN       2         40
Spain      Xavi      2         40 
Spain      Iniesta   1         20

I tried below code:

df2 = df.groupby(['Country', 'Player']).size().reset_index(name='counts')
df2['prcntg'] = df2['counts']/df2.groupby('Country')['counts'].transform('sum')
df2

SeaBean · Accepted Answer

Another way to do it, really producing all results in a single groupby is as follows:

Define a helper function to calculate the percentage:

with dropna=False to keep the NaN values:

f = lambda x: x.size / df.groupby('Country', dropna=False).size()[x.iloc[0]] * 100

The first size function returns counts under the group of ['Country', 'Player'], while the second size function, which is grouped under Country only, returns the counts under the bigger group.

Then, make use of the named aggregation of DataFrameGroupBy.aggregate():

(df.groupby(['Country', 'Player'], dropna=False)
   .agg(counts=('Player', 'size'), 
        prcntg=('Country', f))
)

Result:

                 counts  prcntg
Country Player                 
Arg     Aguero        1    20.0
        Messi         3    60.0
        NaN           1    20.0
Bra     Neymar        1    25.0
        Rivaldo       1    25.0
        Ronaldo       1    25.0
        NaN           1    25.0
Spain   Iniesta       1    20.0
        Xavi          2    40.0
        NaN           2    40.0

Edit

If you got error TypeError: groupby() got an unexpected keyword argument 'dropna', probably your Pandas version is older than version 1.1.0. This dropna parameter, which allows you keeping the NaN counts, is supported since this version. Probably you should consider upgrading your Pandas for richer sets of Pandas features.

If you cannot upgrade at the moment, a workaround solution is to replace NaN in the Player column by some other text, eg. string '_NaN' or some special word before the grouping. You can restore its values after the grouping if you need to. Sample codes below:

import numpy as np

df['Player'] = df['Player'].fillna('_NaN')      # Set `NaN` values to string `_NaN`

# Main processing with all results produced in a single `groupby`:
f = lambda x: x.size / df.groupby('Country').size()[x.iloc[0]] * 100
df_out = (df.groupby(['Country', 'Player'], as_index=False)
            .agg(counts=('Player', 'size'), 
                 prcntg=('Country', f))
         )

df_out['Player'] = df_out['Player'].replace('_NaN', np.nan)     # restore `NaN` values

Result:

print(df_out)

  Country   Player  counts  prcntg
0     Arg   Aguero       1    20.0
1     Arg    Messi       3    60.0
2     Arg      NaN       1    20.0
3     Bra   Neymar       1    25.0
4     Bra  Rivaldo       1    25.0
5     Bra  Ronaldo       1    25.0
6     Bra      NaN       1    25.0
7   Spain  Iniesta       1    20.0
8   Spain     Xavi       2    40.0
9   Spain      NaN       2    40.0

Multi index groupby count including NaN values in pandas and, calculate percentage by single groupby

Answers (2)

Edit

Related Questions