Reputation: 4929
My ideal question would be "How to access columns in pandas DataFrameGroupBy.transform?", but after performing some tests (herein showed), I wonder if that's even possible.
I would like to access column names just like I can do with apply
, but using transform
instead.
For example, given this sample data:
import numpy as np
import pandas as pd
np.random.seed(123)
numeric_data = np.random.rand(9, 3)
cat_data = [f'grp_{i}' for i in range(1,4)] * 3
df = pd.DataFrame(numeric_data, columns=list('ABC')).assign(D = cat_data)
print(df)
A B C D
0 0.696469 0.286139 0.226851 grp_1
1 0.551315 0.719469 0.423106 grp_2
2 0.980764 0.684830 0.480932 grp_3
3 0.392118 0.343178 0.729050 grp_1
4 0.438572 0.059678 0.398044 grp_2
5 0.737995 0.182492 0.175452 grp_3
6 0.531551 0.531828 0.634401 grp_1
7 0.849432 0.724455 0.611024 grp_2
8 0.722443 0.322959 0.361789 grp_3
How can I use transform
to subtract B from A, then multiply by C? Is it possible?
I know that with apply I can easily achieve that by using lambda or passing a user-defined function, like this:
def customFunc(grp):
return (grp['A'] - grp['B']) * grp['C']
df.groupby('D').apply(customFunc)
D
grp_1 0 0.093084
3 0.035679
6 -0.000175
grp_2 1 -0.071147
4 0.150817
7 0.076364
grp_3 2 0.142324
5 0.097464
8 0.144529
dtype: float64
The output values, though, it's unsorted (as you can see in the internal index) so that I can't just place this output as it is in a new column. An option would be sorting the dataframe before-hand using apply, but to be honest I'm not fully confident it would work as expected for large data with more complex groups in groupby. I would feel more comfortable using transform, otherwise, I think it's more reliable to merge the result back to the df by index (supposing we have a unique index).
If I try to use the same function with transform:
df.groupby('D').transform(customFunc)
then I get an error: KeyError: 'A'
.
To inspect what's going on under the hood when using groupby.apply
and groupby.transform
, I did the following:
# Select the target-group
grp = df.groupby('D')
grp.apply(lambda x: type(x))
D
grp_1 (<class 'pandas.core.frame.DataFrame'>, 3)
grp_2 (<class 'pandas.core.frame.DataFrame'>, 3)
grp_3 (<class 'pandas.core.frame.DataFrame'>, 3)
dtype: object
grp.transform(lambda x: type(x))
A B C
0 <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0>
1 <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0>
2 <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0>
3 <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0>
4 <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0>
5 <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0>
6 <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0>
7 <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0>
8 <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0> <property object at 0x7f7dbde619f0>
As you can see, apply
provide us sub-dataframes as groups, while I don't know exactly what transform
does provide (it's the first time I've faced a property). I also have done further tests:
# Another trials
grp.transform(lambda x: x.shape) # ValueError
grp.transform(lambda x: x['A']) # KeyError
grp.transform(lambda x: x.loc[0]) # KeyError
grp.transform(lambda x: x.iloc[0]) # works (every value get the first value; similar to 'first')
It seems that with .iloc[]
I can access the values for each column independently, but I still couldn't figure how to access the columns within transform
(if it's even possible).
So, my final questions:
groupby.transform
to perform calculation across columns?apply
back to the dataframe?Upvotes: 3
Views: 1606
Reputation: 3018
Just to expand on @jezrael's answer:
Is it possible to access column names in
groupby.transform
to perform calculation across columns?Not possible,
.groupby.transform
working with each column separately, so cannot 'see' another columns likegroupby.apply
.
This is because of how transform
is specified (emphases mine):
Notes
Each group is endowed the attribute 'name' in case you need to know which group you are working on.
The current implementation imposes three requirements on f:
- f must return a value that either has the same shape as the input subframe or can be broadcast to the shape of the input subframe. For example, if
f
returns a scalar it will be broadcast to have the
same shape as the input subframe.- if this is a DataFrame, f must support application column-by-column in the subframe. If f also supports application to the entire subframe, then a fast path is used starting from the second chunk.
- f must not mutate groups. Mutation is not supported and may produce unexpected results.
I.e. it requires the function to be able to work on individual columns. But if it works for an entire dataframe, pandas
automatically switches to applying on entire groups midway (seems a bit kludgy to me, but whatever). Again, we can see this using print
in our transform function:
def f(x):
print(x.name, type(x))
return x
In [1]: gb.transform(f)
A <class 'pandas.core.series.Series'>
A <class 'pandas.core.series.Series'>
B <class 'pandas.core.series.Series'>
C <class 'pandas.core.series.Series'>
grp_1 <class 'pandas.core.frame.DataFrame'>
grp_2 <class 'pandas.core.frame.DataFrame'>
grp_3 <class 'pandas.core.frame.DataFrame'>
Out[1]:
A B C
0 0.696469 0.286139 0.226851
1 0.551315 0.719469 0.423106
2 0.980764 0.684830 0.480932
3 0.392118 0.343178 0.729050
4 0.438572 0.059678 0.398044
5 0.737995 0.182492 0.175452
6 0.531551 0.531828 0.634401
7 0.849432 0.724455 0.611024
8 0.722443 0.322959 0.361789
Here we can see what pandas
is doing internally:
A
(not really sure what this is doing, but it's probably to see if the function is compatible or what method to use for best performance);f
column by column on the first group (grp_1
);f
works on an entire dataframe (I'm guessing by trying f
on the first group again and then comparing with the column-by-column result), it switches to applying f
to each group everything at once, as whole dataframes.In fact, knowing this it is actually possible to extract the columns starting from the second part of the algorithm (once it starts applying on entire dataframes):
def f(x):
try:
x["A"]
print(x.name, type(x), "column retrieval succeeded")
except KeyError:
print(x.name, type(x), "column retrieval failed")
return x
In [59]: gb.transform(f)
A <class 'pandas.core.series.Series'> column retrieval failed
A <class 'pandas.core.series.Series'> column retrieval failed
B <class 'pandas.core.series.Series'> column retrieval failed
C <class 'pandas.core.series.Series'> column retrieval failed
grp_1 <class 'pandas.core.frame.DataFrame'> column retrieval succeeded
grp_2 <class 'pandas.core.frame.DataFrame'> column retrieval succeeded
grp_3 <class 'pandas.core.frame.DataFrame'> column retrieval succeeded
Out[59]:
A B C
0 0.696469 0.286139 0.226851
1 0.551315 0.719469 0.423106
2 0.980764 0.684830 0.480932
3 0.392118 0.343178 0.729050
4 0.438572 0.059678 0.398044
5 0.737995 0.182492 0.175452
6 0.531551 0.531828 0.634401
7 0.849432 0.724455 0.611024
8 0.722443 0.322959 0.361789
Of course, this isn't useful in practice, as the transform function, as specified in the documentation, should be able to work on individual columns (and optionally entire dataframes), so you shouldn't be referring to specific columns in this function.
Upvotes: 1
Reputation: 863256
Is it possible to access column names in groupby.transform to perform calculation across columns?
Not possible, .groupby.transform
working with each column separately, so cannot 'see' another columns like groupby.apply
.
You can see it if use print
:
print (df.groupby('D').transform(lambda x: print(x)))
2 0.980764
5 0.737995
8 0.722443
Name: A, dtype: float64
2 0.684830
5 0.182492
8 0.322959
Name: B, dtype: float64
2 0.480932
5 0.175452
8 0.361789
Name: C, dtype: float64
If not, what's the best (reliable) way to place the output from apply back to the dataframe?
If function not aggregate values:
def customFunc(grp):
return (grp['A'] - grp['B']) * grp['C']
df['new'] = df.groupby('D').apply(customFunc).rename('new').reset_index(level=0, drop=True)
print (df)
A B C D new
0 0.696469 0.286139 0.226851 grp_1 0.093084
1 0.551315 0.719469 0.423106 grp_2 -0.071147
2 0.980764 0.684830 0.480932 grp_3 0.142324
3 0.392118 0.343178 0.729050 grp_1 0.035679
4 0.438572 0.059678 0.398044 grp_2 0.150817
5 0.737995 0.182492 0.175452 grp_3 0.097464
6 0.531551 0.531828 0.634401 grp_1 -0.000175
7 0.849432 0.724455 0.611024 grp_2 0.076364
8 0.722443 0.322959 0.361789 grp_3 0.144529
so working same like:
df['new'] = (df['A'] - df['B']) * df['C']
print (df)
A B C D new
0 0.696469 0.286139 0.226851 grp_1 0.093084
1 0.551315 0.719469 0.423106 grp_2 -0.071147
2 0.980764 0.684830 0.480932 grp_3 0.142324
3 0.392118 0.343178 0.729050 grp_1 0.035679
4 0.438572 0.059678 0.398044 grp_2 0.150817
5 0.737995 0.182492 0.175452 grp_3 0.097464
6 0.531551 0.531828 0.634401 grp_1 -0.000175
7 0.849432 0.724455 0.611024 grp_2 0.076364
8 0.722443 0.322959 0.361789 grp_3 0.144529
If function aggregate values use DataFrame.join
or Series.map
if one column is used for grouping:
def customFunc(grp):
return ((grp['A'] - grp['B']) * grp['C']).mean()
df = df.join(df.groupby('D').apply(customFunc).rename('new'), on='D')
def customFunc(grp):
return ((grp['A'] - grp['B']) * grp['C']).mean()
df['new'] = df['D'].map(df.groupby('D').apply(customFunc))
print (df)
A B C D new
0 0.696469 0.286139 0.226851 grp_1 0.042863
1 0.551315 0.719469 0.423106 grp_2 0.052011
2 0.980764 0.684830 0.480932 grp_3 0.128106
3 0.392118 0.343178 0.729050 grp_1 0.042863
4 0.438572 0.059678 0.398044 grp_2 0.052011
5 0.737995 0.182492 0.175452 grp_3 0.128106
6 0.531551 0.531828 0.634401 grp_1 0.042863
7 0.849432 0.724455 0.611024 grp_2 0.052011
8 0.722443 0.322959 0.361789 grp_3 0.128106
Or change function:
def customFunc(grp):
grp['new'] = ((grp['A'] - grp['B']) * grp['C']).mean()
return grp
df = df.groupby('D').apply(customFunc)
print (df)
A B C D new
0 0.696469 0.286139 0.226851 grp_1 0.042863
1 0.551315 0.719469 0.423106 grp_2 0.052011
2 0.980764 0.684830 0.480932 grp_3 0.128106
3 0.392118 0.343178 0.729050 grp_1 0.042863
4 0.438572 0.059678 0.398044 grp_2 0.052011
5 0.737995 0.182492 0.175452 grp_3 0.128106
6 0.531551 0.531828 0.634401 grp_1 0.042863
7 0.849432 0.724455 0.611024 grp_2 0.052011
8 0.722443 0.322959 0.361789 grp_3 0.128106
Upvotes: 3