rds
rds

Reputation: 79

How to get relative percentage of a DF column based on groups and values of other DF columns

I'm trying to get the relative percentage of a data frame column based on the groupings of DF columns and a specific value of another DF column.

I'm using .value_counts() to get the counts for each value, and .value_counts(normalize=True) * 100 to get the relative percentages. The problem is, when I use .value_counts(normalize=True) while slicing the relative percentages are all 100.

import pandas as pd

df = pd.DataFrame({'state':['AK','AK','AK','AK','AK','AL','AL','AL','AL','AL'], 'county':['Cnty1','Cnty1','Cnty1','Cnty2','Cnty2','Cnty3','Cnty3','Cnty3','Cnty4','Cnty4'], 'reason':['move','move','vacate','move','vacate','vacate','move','vacate','move','move']})

I get one of the desired results with this:

StateDF = pd.DataFrame(df.groupby('state')['reason'].value_counts())
StateDF.rename(columns={'reason':'N'}, inplace=True)
StateDF['% of Total'] = df.groupby('state')['reason'].value_counts(normalize=True) * 100

Result:

              N  % of Total
state reason               
AK    move    3        60.0
      vacate  2        40.0
AL    move    3        60.0
      vacate  1        20.0
      vacte   1        20.0

However, when I try to do a similar thing, but group by county and select for only reason == 'move', the counts are correct but the relative percentage is 100 for all.

CountyDF = pd.DataFrame(df[df['reason']=='move'].groupby(['state','county'])['reason'].value_counts())
CountyDF.rename(columns={'disreasn':'N'}, inplace=True)
CountyDF['% of Total'] = pd.DataFrame(df[df['reason']=='move'].groupby(['state','county'])['reason'].value_counts(normalize=True) * 100)

Result:

                     reason  % of Total
state county reason                    
AK    Cnty1  move         2       100.0
      Cnty2  move         1       100.0
AL    Cnty3  move         1       100.0
      Cnty4  move         2       100.0

Upvotes: 0

Views: 611

Answers (1)

Shenan
Shenan

Reputation: 338

It depends on which percentage are you looking at or what is the denominator. In your first example, the denominator is the data with "different reasons" within each state, so it works. In your second example, the denominator is data with 'all reasons' in each county and state, but given here 'different reasons' == 'move', you get the same population in the nominator. If you want the denominator to be data with 'different counties' within each state, you can try

CountyDF['% of Total'] = pd.DataFrame(df[df['reason']=='move'].groupby(['state'])['county'].value_counts(normalize=True) * 100)

which will get you

                     reason  % of Total
state county reason                    
AK    Cnty1  move         2   66.666667
      Cnty2  move         1   33.333333
AL    Cnty3  move         1   33.333333
      Cnty4  move         2   66.666667

Upvotes: 1

Related Questions