Reputation: 2871
I have a df as shown below.
df:
Country Player
Arg Messi
Bra Neymar
Arg NaN
Arg Messi
Arg Aguero
Arg Messi
Bra Ronaldo
Spain Xavi
Spain NaN
Spain NaN
Bra Rivaldo
Spain Iniesta
Bra NaN
Spain Xavi
Where NaN stands for information not available. From the above df, I would like to perform multiple groupby counts as shown below.
Expected output:
Country Player Counts Percentage_of_country
Arg NaN 1 20
Arg Messi 3 60
Arg Aguero 1 20
Bra Neymar 1 25
Bra NaN 1 25
Bra Ronaldo 1 25
Bra Rivaldo 1 25
Spain NaN 2 40
Spain Xavi 2 40
Spain Iniesta 1 20
I tried below code:
df2 = df.groupby(['Country', 'Player']).size().reset_index(name='counts')
df2['prcntg'] = df2['counts']/df2.groupby('Country')['counts'].transform('sum')
df2
Upvotes: 4
Views: 928
Reputation: 23217
Another way to do it, really producing all results in a single groupby
is as follows:
Define a helper function to calculate the percentage:
with dropna=False
to keep the NaN
values:
f = lambda x: x.size / df.groupby('Country', dropna=False).size()[x.iloc[0]] * 100
The first size
function returns counts under the group of ['Country', 'Player']
, while the second size
function, which is grouped under Country
only, returns the counts under the bigger group.
Then, make use of the named aggregation
of DataFrameGroupBy.aggregate()
:
(df.groupby(['Country', 'Player'], dropna=False)
.agg(counts=('Player', 'size'),
prcntg=('Country', f))
)
Result:
counts prcntg
Country Player
Arg Aguero 1 20.0
Messi 3 60.0
NaN 1 20.0
Bra Neymar 1 25.0
Rivaldo 1 25.0
Ronaldo 1 25.0
NaN 1 25.0
Spain Iniesta 1 20.0
Xavi 2 40.0
NaN 2 40.0
If you got error TypeError: groupby() got an unexpected keyword argument 'dropna'
, probably your Pandas version is older than version 1.1.0. This dropna parameter, which allows you keeping the NaN counts, is supported since this version. Probably you should consider upgrading your Pandas for richer sets of Pandas features.
If you cannot upgrade at the moment, a workaround solution is to replace NaN in the Player column by some other text, eg. string '_NaN' or some special word before the grouping. You can restore its values after the grouping if you need to. Sample codes below:
import numpy as np
df['Player'] = df['Player'].fillna('_NaN') # Set `NaN` values to string `_NaN`
# Main processing with all results produced in a single `groupby`:
f = lambda x: x.size / df.groupby('Country').size()[x.iloc[0]] * 100
df_out = (df.groupby(['Country', 'Player'], as_index=False)
.agg(counts=('Player', 'size'),
prcntg=('Country', f))
)
df_out['Player'] = df_out['Player'].replace('_NaN', np.nan) # restore `NaN` values
Result:
print(df_out)
Country Player counts prcntg
0 Arg Aguero 1 20.0
1 Arg Messi 3 60.0
2 Arg NaN 1 20.0
3 Bra Neymar 1 25.0
4 Bra Rivaldo 1 25.0
5 Bra Ronaldo 1 25.0
6 Bra NaN 1 25.0
7 Spain Iniesta 1 20.0
8 Spain Xavi 2 40.0
9 Spain NaN 2 40.0
Upvotes: 3
Reputation: 18426
First group the dataframe by Country
, and Player
, then call size
for the count, and call to_frame
passing the column name to create dataframe out of it. You also need to pass dropna=True
since you want to include NaN
.
After that, you can group the count by level=0
then call tranform
to get the sum for the levels, and divide the count by this value. You can call reset_index
at last if needed.
count=df.groupby(['Country', 'Player'], dropna=False).size().to_frame('Counts')
count['Percentage_of_country']=100*count/count.groupby(level=0).transform('sum')
OUTPUT:
Counts Percentage_of_country
Country Player
Arg Aguero 1 20.0
Messi 3 60.0
NaN 1 20.0
Bra Neymar 1 25.0
Rivaldo 1 25.0
Ronaldo 1 25.0
NaN 1 25.0
Spain Iniesta 1 20.0
Xavi 2 40.0
NaN 2 40.0
The dropna
parameter was introduced in pandas version 1.1.0, so if you are using a version older than that, you can first try to replace the NaN
value to something else, then revert back to NaN
after performing the required operation.
df['Player'] = df['Player'].fillna('#!Missing!#') #replace NaN by #!Missing!#'
count=df.groupby(['Country', 'Player']).size().to_frame('Counts')
count['Percentage_of_country']=100*count/count.groupby(level=0).transform('sum')
count.reset_index(inplace=True)
count['Player'] = count['Player'].replace({'#!Missing!#':float('nan')})
Upvotes: 2