Danish
Danish

Reputation: 2871

Multi index groupby count including NaN values in pandas and, calculate percentage by single groupby

I have a df as shown below.

df:

Country       Player
Arg           Messi
Bra           Neymar
Arg           NaN
Arg           Messi
Arg           Aguero
Arg           Messi
Bra           Ronaldo
Spain         Xavi
Spain         NaN
Spain         NaN
Bra           Rivaldo
Spain         Iniesta
Bra           NaN
Spain         Xavi

Where NaN stands for information not available. From the above df, I would like to perform multiple groupby counts as shown below.

Expected output:

Country    Player    Counts    Percentage_of_country
Arg        NaN       1         20
Arg        Messi     3         60
Arg        Aguero    1         20
Bra        Neymar    1         25
Bra        NaN       1         25
Bra        Ronaldo   1         25
Bra        Rivaldo   1         25
Spain      NaN       2         40
Spain      Xavi      2         40 
Spain      Iniesta   1         20

I tried below code:

df2 = df.groupby(['Country', 'Player']).size().reset_index(name='counts')
df2['prcntg'] = df2['counts']/df2.groupby('Country')['counts'].transform('sum')
df2

Upvotes: 4

Views: 928

Answers (2)

SeaBean
SeaBean

Reputation: 23217

Another way to do it, really producing all results in a single groupby is as follows:

Define a helper function to calculate the percentage:

with dropna=False to keep the NaN values:

f = lambda x: x.size / df.groupby('Country', dropna=False).size()[x.iloc[0]] * 100

The first size function returns counts under the group of ['Country', 'Player'], while the second size function, which is grouped under Country only, returns the counts under the bigger group.

Then, make use of the named aggregation of DataFrameGroupBy.aggregate():

(df.groupby(['Country', 'Player'], dropna=False)
   .agg(counts=('Player', 'size'), 
        prcntg=('Country', f))
)

Result:

                 counts  prcntg
Country Player                 
Arg     Aguero        1    20.0
        Messi         3    60.0
        NaN           1    20.0
Bra     Neymar        1    25.0
        Rivaldo       1    25.0
        Ronaldo       1    25.0
        NaN           1    25.0
Spain   Iniesta       1    20.0
        Xavi          2    40.0
        NaN           2    40.0

Edit

If you got error TypeError: groupby() got an unexpected keyword argument 'dropna', probably your Pandas version is older than version 1.1.0. This dropna parameter, which allows you keeping the NaN counts, is supported since this version. Probably you should consider upgrading your Pandas for richer sets of Pandas features.

If you cannot upgrade at the moment, a workaround solution is to replace NaN in the Player column by some other text, eg. string '_NaN' or some special word before the grouping. You can restore its values after the grouping if you need to. Sample codes below:

import numpy as np

df['Player'] = df['Player'].fillna('_NaN')      # Set `NaN` values to string `_NaN`

# Main processing with all results produced in a single `groupby`:
f = lambda x: x.size / df.groupby('Country').size()[x.iloc[0]] * 100
df_out = (df.groupby(['Country', 'Player'], as_index=False)
            .agg(counts=('Player', 'size'), 
                 prcntg=('Country', f))
         )

df_out['Player'] = df_out['Player'].replace('_NaN', np.nan)     # restore `NaN` values

Result:

print(df_out)

  Country   Player  counts  prcntg
0     Arg   Aguero       1    20.0
1     Arg    Messi       3    60.0
2     Arg      NaN       1    20.0
3     Bra   Neymar       1    25.0
4     Bra  Rivaldo       1    25.0
5     Bra  Ronaldo       1    25.0
6     Bra      NaN       1    25.0
7   Spain  Iniesta       1    20.0
8   Spain     Xavi       2    40.0
9   Spain      NaN       2    40.0

Upvotes: 3

ThePyGuy
ThePyGuy

Reputation: 18426

First group the dataframe by Country, and Player, then call size for the count, and call to_frame passing the column name to create dataframe out of it. You also need to pass dropna=True since you want to include NaN.

After that, you can group the count by level=0 then call tranform to get the sum for the levels, and divide the count by this value. You can call reset_index at last if needed.

count=df.groupby(['Country', 'Player'], dropna=False).size().to_frame('Counts')
count['Percentage_of_country']=100*count/count.groupby(level=0).transform('sum')

OUTPUT:

                 Counts  Percentage_of_country
Country Player                                
Arg     Aguero        1                   20.0
        Messi         3                   60.0
        NaN           1                   20.0
Bra     Neymar        1                   25.0
        Rivaldo       1                   25.0
        Ronaldo       1                   25.0
        NaN           1                   25.0
Spain   Iniesta       1                   20.0
        Xavi          2                   40.0
        NaN           2                   40.0

The dropna parameter was introduced in pandas version 1.1.0, so if you are using a version older than that, you can first try to replace the NaN value to something else, then revert back to NaN after performing the required operation.

df['Player'] = df['Player'].fillna('#!Missing!#')  #replace NaN by #!Missing!#'
count=df.groupby(['Country', 'Player']).size().to_frame('Counts')
count['Percentage_of_country']=100*count/count.groupby(level=0).transform('sum')
count.reset_index(inplace=True)
count['Player'] = count['Player'].replace({'#!Missing!#':float('nan')})

Upvotes: 2

Related Questions