Richard
Richard

Reputation: 61249

Manipulate A Group Column in Pandas

I have a data set with columns Dist, Class, and Count.

I want to group that data set by dist and divide the count column of each group by the sum of the counts for that group (normalize it to one).

The following MWE demonstrates my approach thus far. But I wonder: is there a more compact/pandaific way of writing this?

import pandas as pd
import numpy as np

a = np.random.randint(0,4,(10,3))
s = pd.DataFrame(a,columns=['Dist','Class','Count'])

def manipcolumn(x):
    csum = x['Count'].sum()
    x['Count'] = x['Count'].apply(lambda x: x/csum)
    return x

s.groupby('Dist').apply(manipcolumn)

Upvotes: 0

Views: 252

Answers (1)

Alex Riley
Alex Riley

Reputation: 176730

One alternative way to get the normalised 'Count' column could be to use groupby and transform to get the sums for each group and then divide the returned Series by the 'Count' column. You can reassign this Series back to your DataFrame:

s['Count'] = s['Count'] / s.groupby('Dist')['Count'].transform(np.sum)

This avoids the need for a bespoke Python function and the use of apply. Testing it for the small example DataFrame in your question showed that it was around 8 times faster.

Upvotes: 2

Related Questions