Is it possible to access column names within pandas groupby transform?

Question

My ideal question would be "How to access columns in pandas DataFrameGroupBy.transform?", but after performing some tests (herein showed), I wonder if that's even possible.

I would like to access column names just like I can do with apply, but using transform instead.

For example, given this sample data:

import numpy as np
import pandas as pd

np.random.seed(123)
numeric_data = np.random.rand(9, 3)
cat_data = [f'grp_{i}' for i in range(1,4)] * 3
df = pd.DataFrame(numeric_data, columns=list('ABC')).assign(D = cat_data)
print(df)

          A         B         C      D
0  0.696469  0.286139  0.226851  grp_1
1  0.551315  0.719469  0.423106  grp_2
2  0.980764  0.684830  0.480932  grp_3
3  0.392118  0.343178  0.729050  grp_1
4  0.438572  0.059678  0.398044  grp_2
5  0.737995  0.182492  0.175452  grp_3
6  0.531551  0.531828  0.634401  grp_1
7  0.849432  0.724455  0.611024  grp_2
8  0.722443  0.322959  0.361789  grp_3

How can I use transform to subtract B from A, then multiply by C? Is it possible?
I know that with apply I can easily achieve that by using lambda or passing a user-defined function, like this:

def customFunc(grp):
    return (grp['A'] - grp['B']) * grp['C']

df.groupby('D').apply(customFunc)

D       
grp_1  0    0.093084
       3    0.035679
       6   -0.000175
grp_2  1   -0.071147
       4    0.150817
       7    0.076364
grp_3  2    0.142324
       5    0.097464
       8    0.144529
dtype: float64

The output values, though, it's unsorted (as you can see in the internal index) so that I can't just place this output as it is in a new column. An option would be sorting the dataframe before-hand using apply, but to be honest I'm not fully confident it would work as expected for large data with more complex groups in groupby. I would feel more comfortable using transform, otherwise, I think it's more reliable to merge the result back to the df by index (supposing we have a unique index).

If I try to use the same function with transform:

df.groupby('D').transform(customFunc)

then I get an error: KeyError: 'A'.

To inspect what's going on under the hood when using groupby.apply and groupby.transform, I did the following:

# Select the target-group
grp = df.groupby('D')

grp.apply(lambda x: type(x))

D
grp_1    (, 3)
grp_2    (, 3)
grp_3    (, 3)
dtype: object

grp.transform(lambda x: type(x))
                                     A                                    B                                    C  
0        
1        
2        
3        
4        
5        
6        
7        
8

As you can see, apply provide us sub-dataframes as groups, while I don't know exactly what transform does provide (it's the first time I've faced a property). I also have done further tests:

# Another trials
grp.transform(lambda x: x.shape) # ValueError
grp.transform(lambda x: x['A']) # KeyError
grp.transform(lambda x: x.loc[0]) # KeyError
grp.transform(lambda x: x.iloc[0]) # works (every value get the first value; similar to 'first')

It seems that with .iloc[] I can access the values for each column independently, but I still couldn't figure how to access the columns within transform (if it's even possible).

So, my final questions:

Is it possible to access column names in groupby.transform to perform calculation across columns?
If not, what's the best (reliable) way to place the output from apply back to the dataframe?

jezrael · Accepted Answer

Is it possible to access column names in groupby.transform to perform calculation across columns?

Not possible, .groupby.transform working with each column separately, so cannot 'see' another columns like groupby.apply.

You can see it if use print:

print (df.groupby('D').transform(lambda x: print(x)))
2    0.980764
5    0.737995
8    0.722443
Name: A, dtype: float64
2    0.684830
5    0.182492
8    0.322959
Name: B, dtype: float64
2    0.480932
5    0.175452
8    0.361789
Name: C, dtype: float64

If not, what's the best (reliable) way to place the output from apply back to the dataframe?

If function not aggregate values:

def customFunc(grp):
    return (grp['A'] - grp['B']) * grp['C']

df['new'] = df.groupby('D').apply(customFunc).rename('new').reset_index(level=0, drop=True)

print (df)
          A         B         C      D       new
0  0.696469  0.286139  0.226851  grp_1  0.093084
1  0.551315  0.719469  0.423106  grp_2 -0.071147
2  0.980764  0.684830  0.480932  grp_3  0.142324
3  0.392118  0.343178  0.729050  grp_1  0.035679
4  0.438572  0.059678  0.398044  grp_2  0.150817
5  0.737995  0.182492  0.175452  grp_3  0.097464
6  0.531551  0.531828  0.634401  grp_1 -0.000175
7  0.849432  0.724455  0.611024  grp_2  0.076364
8  0.722443  0.322959  0.361789  grp_3  0.144529

so working same like:

df['new'] = (df['A'] - df['B']) * df['C']

print (df)
         A         B         C      D       new
0  0.696469  0.286139  0.226851  grp_1  0.093084
1  0.551315  0.719469  0.423106  grp_2 -0.071147
2  0.980764  0.684830  0.480932  grp_3  0.142324
3  0.392118  0.343178  0.729050  grp_1  0.035679
4  0.438572  0.059678  0.398044  grp_2  0.150817
5  0.737995  0.182492  0.175452  grp_3  0.097464
6  0.531551  0.531828  0.634401  grp_1 -0.000175
7  0.849432  0.724455  0.611024  grp_2  0.076364
8  0.722443  0.322959  0.361789  grp_3  0.144529

If function aggregate values use DataFrame.join or Series.map if one column is used for grouping:

def customFunc(grp):
    return ((grp['A'] - grp['B']) * grp['C']).mean()

df = df.join(df.groupby('D').apply(customFunc).rename('new'), on='D')

def customFunc(grp):
    return ((grp['A'] - grp['B']) * grp['C']).mean()

df['new'] = df['D'].map(df.groupby('D').apply(customFunc))


print (df)
       A         B         C      D       new
0  0.696469  0.286139  0.226851  grp_1  0.042863
1  0.551315  0.719469  0.423106  grp_2  0.052011
2  0.980764  0.684830  0.480932  grp_3  0.128106
3  0.392118  0.343178  0.729050  grp_1  0.042863
4  0.438572  0.059678  0.398044  grp_2  0.052011
5  0.737995  0.182492  0.175452  grp_3  0.128106
6  0.531551  0.531828  0.634401  grp_1  0.042863
7  0.849432  0.724455  0.611024  grp_2  0.052011
8  0.722443  0.322959  0.361789  grp_3  0.128106

Or change function:

def customFunc(grp):
    grp['new'] = ((grp['A'] - grp['B']) * grp['C']).mean()
    return grp
    
df = df.groupby('D').apply(customFunc)

print (df)
          A         B         C      D       new
0  0.696469  0.286139  0.226851  grp_1  0.042863
1  0.551315  0.719469  0.423106  grp_2  0.052011
2  0.980764  0.684830  0.480932  grp_3  0.128106
3  0.392118  0.343178  0.729050  grp_1  0.042863
4  0.438572  0.059678  0.398044  grp_2  0.052011
5  0.737995  0.182492  0.175452  grp_3  0.128106
6  0.531551  0.531828  0.634401  grp_1  0.042863
7  0.849432  0.724455  0.611024  grp_2  0.052011
8  0.722443  0.322959  0.361789  grp_3  0.128106

Is it possible to access column names within pandas groupby transform?

Answers (2)

Related Questions