Is it possible to access column names within pandas groupby transform?

My ideal question would be "How to access columns in pandas DataFrameGroupBy.transform?", but after performing some tests (herein showed), I wonder if that's even possible.

I would like to access column names just like I can do with apply, but using transform instead.

For example, given this sample data:

import numpy as np
import pandas as pd

np.random.seed(123)
numeric_data = np.random.rand(9, 3)
cat_data = [f'grp_{i}' for i in range(1,4)] * 3
df = pd.DataFrame(numeric_data, columns=list('ABC')).assign(D = cat_data)
print(df)
          A         B         C      D
0  0.696469  0.286139  0.226851  grp_1
1  0.551315  0.719469  0.423106  grp_2
2  0.980764  0.684830  0.480932  grp_3
3  0.392118  0.343178  0.729050  grp_1
4  0.438572  0.059678  0.398044  grp_2
5  0.737995  0.182492  0.175452  grp_3
6  0.531551  0.531828  0.634401  grp_1
7  0.849432  0.724455  0.611024  grp_2
8  0.722443  0.322959  0.361789  grp_3

How can I use transform to subtract B from A, then multiply by C? Is it possible?
I know that with apply I can easily achieve that by using lambda or passing a user-defined function, like this:

def customFunc(grp):
    return (grp['A'] - grp['B']) * grp['C']

df.groupby('D').apply(customFunc)
D       
grp_1  0    0.093084
       3    0.035679
       6   -0.000175
grp_2  1   -0.071147
       4    0.150817
       7    0.076364
grp_3  2    0.142324
       5    0.097464
       8    0.144529
dtype: float64

The output values, though, it's unsorted (as you can see in the internal index) so that I can't just place this output as it is in a new column. An option would be sorting the dataframe before-hand using apply, but to be honest I'm not fully confident it would work as expected for large data with more complex groups in groupby. I would feel more comfortable using transform, otherwise, I think it's more reliable to merge the result back to the df by index (supposing we have a unique index).

If I try to use the same function with transform:

df.groupby('D').transform(customFunc)

then I get an error: KeyError: 'A'.

To inspect what's going on under the hood when using groupby.apply and groupby.transform, I did the following:

# Select the target-group
grp = df.groupby('D')

grp.apply(lambda x: type(x))

D
grp_1    (<class 'pandas.core.frame.DataFrame'>, 3)
grp_2    (<class 'pandas.core.frame.DataFrame'>, 3)
grp_3    (<class 'pandas.core.frame.DataFrame'>, 3)
dtype: object

grp.transform(lambda x: type(x))
                                     A                                    B                                    C  
0  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  
1  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  
2  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  
3  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  
4  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  
5  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  
6  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  
7  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  
8  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0>  <property object at 0x7f7dbde619f0> 

As you can see, apply provide us sub-dataframes as groups, while I don't know exactly what transform does provide (it's the first time I've faced a property). I also have done further tests:

# Another trials
grp.transform(lambda x: x.shape) # ValueError
grp.transform(lambda x: x['A']) # KeyError
grp.transform(lambda x: x.loc[0]) # KeyError
grp.transform(lambda x: x.iloc[0]) # works (every value get the first value; similar to 'first')

It seems that with .iloc[] I can access the values for each column independently, but I still couldn't figure how to access the columns within transform (if it's even possible).

So, my final questions:

Upvotes: 3

Views: 1606

Answers (2)

Anakhand
Anakhand

Reputation: 3018

Just to expand on @jezrael's answer:

Is it possible to access column names in groupby.transform to perform calculation across columns?

Not possible, .groupby.transform working with each column separately, so cannot 'see' another columns like groupby.apply.

This is because of how transform is specified (emphases mine):

Notes


Each group is endowed the attribute 'name' in case you need to know which group you are working on.

The current implementation imposes three requirements on f:

  • f must return a value that either has the same shape as the input subframe or can be broadcast to the shape of the input subframe. For example, if f returns a scalar it will be broadcast to have the
    same shape as the input subframe.
  • if this is a DataFrame, f must support application column-by-column in the subframe. If f also supports application to the entire subframe, then a fast path is used starting from the second chunk.
  • f must not mutate groups. Mutation is not supported and may produce unexpected results.

I.e. it requires the function to be able to work on individual columns. But if it works for an entire dataframe, pandas automatically switches to applying on entire groups midway (seems a bit kludgy to me, but whatever). Again, we can see this using print in our transform function:

def f(x):
    print(x.name, type(x))
    return x
In [1]: gb.transform(f)
A <class 'pandas.core.series.Series'>
A <class 'pandas.core.series.Series'>
B <class 'pandas.core.series.Series'>
C <class 'pandas.core.series.Series'>
grp_1 <class 'pandas.core.frame.DataFrame'>
grp_2 <class 'pandas.core.frame.DataFrame'>
grp_3 <class 'pandas.core.frame.DataFrame'>
Out[1]:
          A         B         C
0  0.696469  0.286139  0.226851
1  0.551315  0.719469  0.423106
2  0.980764  0.684830  0.480932
3  0.392118  0.343178  0.729050
4  0.438572  0.059678  0.398044
5  0.737995  0.182492  0.175452
6  0.531551  0.531828  0.634401
7  0.849432  0.724455  0.611024
8  0.722443  0.322959  0.361789

Here we can see what pandas is doing internally:

  • first, a "trial run" on the first column, A (not really sure what this is doing, but it's probably to see if the function is compatible or what method to use for best performance);
  • then it applies f column by column on the first group (grp_1);
  • once it realises that f works on an entire dataframe (I'm guessing by trying f on the first group again and then comparing with the column-by-column result), it switches to applying f to each group everything at once, as whole dataframes.

In fact, knowing this it is actually possible to extract the columns starting from the second part of the algorithm (once it starts applying on entire dataframes):

def f(x):
    try:
        x["A"]
        print(x.name, type(x), "column retrieval succeeded")
    except KeyError:
        print(x.name, type(x), "column retrieval failed")
    return x
In [59]: gb.transform(f)
A <class 'pandas.core.series.Series'> column retrieval failed
A <class 'pandas.core.series.Series'> column retrieval failed
B <class 'pandas.core.series.Series'> column retrieval failed
C <class 'pandas.core.series.Series'> column retrieval failed
grp_1 <class 'pandas.core.frame.DataFrame'> column retrieval succeeded
grp_2 <class 'pandas.core.frame.DataFrame'> column retrieval succeeded
grp_3 <class 'pandas.core.frame.DataFrame'> column retrieval succeeded
Out[59]:
          A         B         C
0  0.696469  0.286139  0.226851
1  0.551315  0.719469  0.423106
2  0.980764  0.684830  0.480932
3  0.392118  0.343178  0.729050
4  0.438572  0.059678  0.398044
5  0.737995  0.182492  0.175452
6  0.531551  0.531828  0.634401
7  0.849432  0.724455  0.611024
8  0.722443  0.322959  0.361789

Of course, this isn't useful in practice, as the transform function, as specified in the documentation, should be able to work on individual columns (and optionally entire dataframes), so you shouldn't be referring to specific columns in this function.

Upvotes: 1

jezrael
jezrael

Reputation: 863256

Is it possible to access column names in groupby.transform to perform calculation across columns?

Not possible, .groupby.transform working with each column separately, so cannot 'see' another columns like groupby.apply.

You can see it if use print:

print (df.groupby('D').transform(lambda x: print(x)))
2    0.980764
5    0.737995
8    0.722443
Name: A, dtype: float64
2    0.684830
5    0.182492
8    0.322959
Name: B, dtype: float64
2    0.480932
5    0.175452
8    0.361789
Name: C, dtype: float64

If not, what's the best (reliable) way to place the output from apply back to the dataframe?

If function not aggregate values:

def customFunc(grp):
    return (grp['A'] - grp['B']) * grp['C']

df['new'] = df.groupby('D').apply(customFunc).rename('new').reset_index(level=0, drop=True)

print (df)
          A         B         C      D       new
0  0.696469  0.286139  0.226851  grp_1  0.093084
1  0.551315  0.719469  0.423106  grp_2 -0.071147
2  0.980764  0.684830  0.480932  grp_3  0.142324
3  0.392118  0.343178  0.729050  grp_1  0.035679
4  0.438572  0.059678  0.398044  grp_2  0.150817
5  0.737995  0.182492  0.175452  grp_3  0.097464
6  0.531551  0.531828  0.634401  grp_1 -0.000175
7  0.849432  0.724455  0.611024  grp_2  0.076364
8  0.722443  0.322959  0.361789  grp_3  0.144529

so working same like:

df['new'] = (df['A'] - df['B']) * df['C']

print (df)
         A         B         C      D       new
0  0.696469  0.286139  0.226851  grp_1  0.093084
1  0.551315  0.719469  0.423106  grp_2 -0.071147
2  0.980764  0.684830  0.480932  grp_3  0.142324
3  0.392118  0.343178  0.729050  grp_1  0.035679
4  0.438572  0.059678  0.398044  grp_2  0.150817
5  0.737995  0.182492  0.175452  grp_3  0.097464
6  0.531551  0.531828  0.634401  grp_1 -0.000175
7  0.849432  0.724455  0.611024  grp_2  0.076364
8  0.722443  0.322959  0.361789  grp_3  0.144529

If function aggregate values use DataFrame.join or Series.map if one column is used for grouping:

def customFunc(grp):
    return ((grp['A'] - grp['B']) * grp['C']).mean()

df = df.join(df.groupby('D').apply(customFunc).rename('new'), on='D')

def customFunc(grp):
    return ((grp['A'] - grp['B']) * grp['C']).mean()

df['new'] = df['D'].map(df.groupby('D').apply(customFunc))


print (df)
       A         B         C      D       new
0  0.696469  0.286139  0.226851  grp_1  0.042863
1  0.551315  0.719469  0.423106  grp_2  0.052011
2  0.980764  0.684830  0.480932  grp_3  0.128106
3  0.392118  0.343178  0.729050  grp_1  0.042863
4  0.438572  0.059678  0.398044  grp_2  0.052011
5  0.737995  0.182492  0.175452  grp_3  0.128106
6  0.531551  0.531828  0.634401  grp_1  0.042863
7  0.849432  0.724455  0.611024  grp_2  0.052011
8  0.722443  0.322959  0.361789  grp_3  0.128106

Or change function:

def customFunc(grp):
    grp['new'] = ((grp['A'] - grp['B']) * grp['C']).mean()
    return grp
    
df = df.groupby('D').apply(customFunc)

print (df)
          A         B         C      D       new
0  0.696469  0.286139  0.226851  grp_1  0.042863
1  0.551315  0.719469  0.423106  grp_2  0.052011
2  0.980764  0.684830  0.480932  grp_3  0.128106
3  0.392118  0.343178  0.729050  grp_1  0.042863
4  0.438572  0.059678  0.398044  grp_2  0.052011
5  0.737995  0.182492  0.175452  grp_3  0.128106
6  0.531551  0.531828  0.634401  grp_1  0.042863
7  0.849432  0.724455  0.611024  grp_2  0.052011
8  0.722443  0.322959  0.361789  grp_3  0.128106

Upvotes: 3

Related Questions