Reputation: 1083
I follow the example here: (https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#flexible-apply)
Data:
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
Groupby 'A' but selecting on column 'C', then perform apply
grouped = df.groupby('A')['C']
def f(group):
return pd.DataFrame({'original': group,
'demeaned': group - group.mean()})
grouped.apply(f)
Everything is ok, but when I try with groupby'A' and selecting column 'C' and 'D', I cannot succeed in doing so:
grouped = df.groupby('A')[['C', 'D']]
for name, val in grouped:
print(name)
print(val)
grouped.apply(f)
So what do I do wrong here?
Thank you Phan
Upvotes: 0
Views: 354
Reputation: 142651
When you get single column (['C']
) then it gives pandas.Series
, but when you get many columns ([ ['C', 'D'] ]
) then it gives pandas.DataFrame
- and this need different code in f()
It could be
grouped = df.groupby('A')[['C', 'D']]
def f(group):
return pd.DataFrame({
'original_C': group['C'],
'original_D': group['D'],
'demeaned_C': group['C'] - group['C'].mean(),
'demeaned_D': group['D'] - group['D'].mean(),
})
grouped.apply(f)
Result:
original_C original_D demeaned_C demeaned_D
0 -0.122789 0.216775 -0.611724 1.085802
1 -0.500153 0.912777 -0.293509 0.210248
2 0.875879 -1.582470 0.386944 -0.713443
3 -0.250717 1.770375 -0.044073 1.067846
4 1.261891 0.177318 0.772956 1.046345
5 0.130939 -0.575565 0.337582 -1.278094
6 -1.121481 -0.964481 -1.610417 -0.095454
7 1.551176 -2.192277 1.062241 -1.323250
Because with two columns you already have DataFrame
so you can also write it shorter without converting to pd.DataFrame()
def f(group):
group[['demeaned_C', 'demeaned_D']] = group - group.mean()
return group
or more universal
def f(group):
for col in group.columns:
group[f'demeaned_{col}'] = group[col] - group[col].mean()
return group
BTW:
If you use [ ['C'] ]
instead of ['C']
then you also get DataFrame
instead of Series
and you can use last version of f()
.
Upvotes: 2