Cornel Westside
Cornel Westside

Reputation: 117

Using groupby and to create a column with percentage frequency

Working in python, in a Jupyter notebook. I am given this dataframe

congress    chamber state   party
80           house   TX       D
80           house   TX       D
80           house   NJ       D
80           house   TX       R
80          senate   KY       R

of every congressperson since the 80th congressional term, with a bunch of information. I've narrowed it down to what's needed for this question. I want to alter the dataframe so that I have a single row for every unique combination of congressional term, chamber, state, and party affiliation, Then a new column with the number of rows that are of the associated party divided by the number of rows where everything else besides that is the same. For example, this

congress    chamber state   party   perc
80           house  TX       D      0.66
80           house  NJ       D      1
80           house  TX       R      0.33
80           senate KY       R      1

is what I'd want my result to look like. The perc column is the percentage of, for example, democrats elected to congress in TX in the 80th congressional election.

I've tried a few different methods I've found on here, but most of them divide the number of rows by the number of rows in the entire dataframe, rather than by just the rows that meet the 3 given criteria. Here's the latest thing I've tried:

term=80
newdf = pd.crosstab(index=df['party'], columns=df['state']).stack()/len(df[df['congress']==term])

I define term because I'll only care about one term at a time for each dataframe.

A method I tried using groupby involved the following:

newdf = df.groupby(['congress', 'chamber','state']).agg({'party': 'count'})
state_pcts = newdf.groupby('party').apply(lambda x:
                                                 100 * x / float(x.sum()))

And it does group by term, chamber, state, but it returns a number that doesn't mean anything to me, when I check what the actual results should be.

Upvotes: 0

Views: 80

Answers (1)

YOLO
YOLO

Reputation: 21709

Basically, you can do the following using value_counts for each group:

def func(f):
    return f['party'].value_counts(normalize=True)

df = (df
     .groupby(['congress','chamber','state'])
     .apply(func)
     .reset_index()
     .rename(columns={'party':'perc','level_3':'party'}))

print(df)

   congress chamber state party      perc
0        80   house    NJ     D  1.000000
1        80   house    TX     D  0.666667
2        80   house    TX     R  0.333333
3        80  senate    KY     R  1.000000

Upvotes: 1

Related Questions