Reputation: 79
I'm trying to get the relative percentage of a data frame column based on the groupings of DF columns and a specific value of another DF column.
I'm using .value_counts() to get the counts for each value, and .value_counts(normalize=True) * 100
to get the relative percentages. The problem is, when I use .value_counts(normalize=True)
while slicing the relative percentages are all 100.
import pandas as pd
df = pd.DataFrame({'state':['AK','AK','AK','AK','AK','AL','AL','AL','AL','AL'], 'county':['Cnty1','Cnty1','Cnty1','Cnty2','Cnty2','Cnty3','Cnty3','Cnty3','Cnty4','Cnty4'], 'reason':['move','move','vacate','move','vacate','vacate','move','vacate','move','move']})
I get one of the desired results with this:
StateDF = pd.DataFrame(df.groupby('state')['reason'].value_counts())
StateDF.rename(columns={'reason':'N'}, inplace=True)
StateDF['% of Total'] = df.groupby('state')['reason'].value_counts(normalize=True) * 100
Result:
N % of Total
state reason
AK move 3 60.0
vacate 2 40.0
AL move 3 60.0
vacate 1 20.0
vacte 1 20.0
However, when I try to do a similar thing, but group by county and select for only reason == 'move'
, the counts are correct but the relative percentage is 100 for all.
CountyDF = pd.DataFrame(df[df['reason']=='move'].groupby(['state','county'])['reason'].value_counts())
CountyDF.rename(columns={'disreasn':'N'}, inplace=True)
CountyDF['% of Total'] = pd.DataFrame(df[df['reason']=='move'].groupby(['state','county'])['reason'].value_counts(normalize=True) * 100)
Result:
reason % of Total
state county reason
AK Cnty1 move 2 100.0
Cnty2 move 1 100.0
AL Cnty3 move 1 100.0
Cnty4 move 2 100.0
Upvotes: 0
Views: 611
Reputation: 338
It depends on which percentage are you looking at or what is the denominator. In your first example, the denominator is the data with "different reasons" within each state, so it works. In your second example, the denominator is data with 'all reasons' in each county and state, but given here 'different reasons' == 'move', you get the same population in the nominator. If you want the denominator to be data with 'different counties' within each state, you can try
CountyDF['% of Total'] = pd.DataFrame(df[df['reason']=='move'].groupby(['state'])['county'].value_counts(normalize=True) * 100)
which will get you
reason % of Total
state county reason
AK Cnty1 move 2 66.666667
Cnty2 move 1 33.333333
AL Cnty3 move 1 33.333333
Cnty4 move 2 66.666667
Upvotes: 1