Vincent_Adultman
Vincent_Adultman

Reputation: 108

Using groupby -> transform(func) on a pandas dataframe when the function is performed on multiple columns, prioritizing speed

So let's say I have a dataframe:

df= pd.DataFrame({'grouping_col':['B','B','S','S','S','B'],'column1':[10,20,30,40,50,60],'column2':[5,6,7,8,9,10],'column3':[11,12,13,14,15,16],'column4':[16,17,18,19,20,21]})

df.set_index('grouping_col',inplace= True)

So let's say I have a function.

def func(x,col1,col2,col3,col4):
    return sum(x[col1]**2 * x[col2] * x[col3] * x[col4]) / sum(x[col1]**2 *x[col3]**2 * x[col4]**2)

I am trying to use a group by on a dataframe and then apply the function. What works is:

new_col= df.groupby('grouping_col').apply(func,'column1','column2','column3','column4').rename('new_col_name')
df = df.join(new_col,how= 'left')
print(df)

              column1  column2  column3  column4  new_col_name
grouping_col                                                  
B                  10        5       11       16      0.029738
B                  20        6       12       17      0.029738
B                  60       10       16       21      0.029738
S                  30        7       13       18      0.030011
S                  40        8       14       19      0.030011
S                  50        9       15       20      0.030011

This works fine, but I need to apply this function thousands of times on a much bigger/more complex dataframe and the joins take an unreasonably long amount of time, so I'm trying to come up with a quicker methodology.

So far I've tried:

df['new_col_name'] = df.groupby('grouping_col').transform(lambda x: func(x,'column1','column2','column3','column4'))

result: KeyError: ('column1', 'occurred at index column1')

So I guess it doesn't work because transform operate on each single column individually. Does anyone have any better ideas? Thanks in advance!

Upvotes: 1

Views: 102

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71689

One possible idea is to calculate the numerator n and denominator d of the mathematical operation before using groupby, then use Series.groupby to group the numerator and denominator on index of the dataframe finally use Groupby.transform to transform numerator and denominator using sum.

n = df['column1']**2 * df['column2'] * df['column3'] * df['column4']
d = df['column1']**2 * df['column3']**2 * df['column4']**2

df['new_col_name'] = (
    n.groupby(df.index).transform('sum') /
    d.groupby(df.index).transform('sum')
)

OR,

df['n'] = df['column1']**2 * df['column2'] * df['column3'] * df['column4']
df['d'] = df['column1']**2 * df['column3']**2 * df['column4']**2

df1 = df.groupby(level=0)[['n', 'd']].transform('sum')
df = df.assign(new_col_name=df1['n'].div(df1['d'])).drop(['n', 'd'], 1)

Result:

# print(df)
              column1  column2  column3  column4  new_col_name
grouping_col                                                  
B                  10        5       11       16      0.029738
B                  20        6       12       17      0.029738
S                  30        7       13       18      0.030011
S                  40        8       14       19      0.030011
S                  50        9       15       20      0.030011
B                  60       10       16       21      0.029738

Upvotes: 2

Related Questions