PTQuoc
PTQuoc

Reputation: 1083

Groupby selecting certain columns

I follow the example here: (https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#flexible-apply)

Data:

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

Groupby 'A' but selecting on column 'C', then perform apply

grouped = df.groupby('A')['C']

def f(group):
    return pd.DataFrame({'original': group,
                         'demeaned': group - group.mean()})

grouped.apply(f)

Everything is ok, but when I try with groupby'A' and selecting column 'C' and 'D', I cannot succeed in doing so:

grouped = df.groupby('A')[['C', 'D']]

for name, val in grouped:
    print(name)
    print(val)

grouped.apply(f)

So what do I do wrong here?

Thank you Phan

Upvotes: 0

Views: 354

Answers (1)

furas
furas

Reputation: 142651

When you get single column (['C']) then it gives pandas.Series, but when you get many columns ([ ['C', 'D'] ]) then it gives pandas.DataFrame - and this need different code in f()

It could be

grouped = df.groupby('A')[['C', 'D']]

def f(group):
    return pd.DataFrame({
                'original_C': group['C'],
                'original_D': group['D'],
                'demeaned_C': group['C'] - group['C'].mean(),
                'demeaned_D': group['D'] - group['D'].mean(),
           })

grouped.apply(f)

Result:

   original_C  original_D  demeaned_C  demeaned_D
0   -0.122789    0.216775   -0.611724    1.085802
1   -0.500153    0.912777   -0.293509    0.210248
2    0.875879   -1.582470    0.386944   -0.713443
3   -0.250717    1.770375   -0.044073    1.067846
4    1.261891    0.177318    0.772956    1.046345
5    0.130939   -0.575565    0.337582   -1.278094
6   -1.121481   -0.964481   -1.610417   -0.095454
7    1.551176   -2.192277    1.062241   -1.323250

Because with two columns you already have DataFrame so you can also write it shorter without converting to pd.DataFrame()

def f(group):
    group[['demeaned_C', 'demeaned_D']] = group - group.mean()

    return group

or more universal

def f(group):
    for col in group.columns:
        group[f'demeaned_{col}'] = group[col] - group[col].mean()

    return group

BTW:

If you use [ ['C'] ] instead of ['C'] then you also get DataFrame instead of Series and you can use last version of f().

Upvotes: 2

Related Questions