Reputation: 108
So let's say I have a dataframe:
df= pd.DataFrame({'grouping_col':['B','B','S','S','S','B'],'column1':[10,20,30,40,50,60],'column2':[5,6,7,8,9,10],'column3':[11,12,13,14,15,16],'column4':[16,17,18,19,20,21]})
df.set_index('grouping_col',inplace= True)
So let's say I have a function.
def func(x,col1,col2,col3,col4):
return sum(x[col1]**2 * x[col2] * x[col3] * x[col4]) / sum(x[col1]**2 *x[col3]**2 * x[col4]**2)
I am trying to use a group by on a dataframe and then apply the function. What works is:
new_col= df.groupby('grouping_col').apply(func,'column1','column2','column3','column4').rename('new_col_name')
df = df.join(new_col,how= 'left')
print(df)
column1 column2 column3 column4 new_col_name
grouping_col
B 10 5 11 16 0.029738
B 20 6 12 17 0.029738
B 60 10 16 21 0.029738
S 30 7 13 18 0.030011
S 40 8 14 19 0.030011
S 50 9 15 20 0.030011
This works fine, but I need to apply this function thousands of times on a much bigger/more complex dataframe and the joins take an unreasonably long amount of time, so I'm trying to come up with a quicker methodology.
So far I've tried:
df['new_col_name'] = df.groupby('grouping_col').transform(lambda x: func(x,'column1','column2','column3','column4'))
result: KeyError: ('column1', 'occurred at index column1')
So I guess it doesn't work because transform operate on each single column individually. Does anyone have any better ideas? Thanks in advance!
Upvotes: 1
Views: 102
Reputation: 71689
One possible idea is to calculate the numerator n
and denominator d
of the mathematical operation before using groupby, then use Series.groupby
to group the numerator and denominator on index of the dataframe finally use Groupby.transform
to transform numerator and denominator using sum
.
n = df['column1']**2 * df['column2'] * df['column3'] * df['column4']
d = df['column1']**2 * df['column3']**2 * df['column4']**2
df['new_col_name'] = (
n.groupby(df.index).transform('sum') /
d.groupby(df.index).transform('sum')
)
OR,
df['n'] = df['column1']**2 * df['column2'] * df['column3'] * df['column4']
df['d'] = df['column1']**2 * df['column3']**2 * df['column4']**2
df1 = df.groupby(level=0)[['n', 'd']].transform('sum')
df = df.assign(new_col_name=df1['n'].div(df1['d'])).drop(['n', 'd'], 1)
Result:
# print(df)
column1 column2 column3 column4 new_col_name
grouping_col
B 10 5 11 16 0.029738
B 20 6 12 17 0.029738
S 30 7 13 18 0.030011
S 40 8 14 19 0.030011
S 50 9 15 20 0.030011
B 60 10 16 21 0.029738
Upvotes: 2