FooBar
FooBar

Reputation: 16488

Pandas group by adds index

Consider the following code, which aims at dividing a column by its grouped mean:

df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
                   'groupid' : np.random.choice(['one','two'], n),
                  'coef' : np.random.randn(n)})
df.set_index('expenditure', inplace=True)
test = df.groupby(level=0).apply(lambda x: x['coef'] / x.coef.mean())

I like this structure of data before applying, before I am usually able to just df['someNewColumn'] = df.apply(...). But strangely, this time, I'm not able to instantly remerge the results.

test should be indexed by the indices that the groupby happened, expenditure. However, it has a double index:

>>> test
expenditure  expenditure
bar          bar           -0.491900
             bar           -9.332964
             bar            8.019472
             bar           -4.540905
             bar            5.627947
             bar           -0.171765
             bar            5.698813
             bar            6.476207
             bar            8.796249
             bar           -8.284087
             bar            1.426311
             bar           -1.223377
foo          foo            1.900897
             foo            7.057078
             foo            0.060856
             foo            3.850323
             foo            2.928085
             foo           -3.249857
             foo            3.176616
             foo           -1.433766
             foo            0.910017
             foo            1.395376
             foo            1.898315
             foo           -1.903462
             foo           -3.590479
Name: coef, dtype: float64

Why does it have a double index and how can I get my normalized column?

>>> test.index
MultiIndex(levels=[[u'bar', u'foo'], [u'bar', u'foo']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
           names=[u'expenditure', u'expenditure'])

My pandas version is 0.15.0.

Upvotes: 1

Views: 122

Answers (3)

Paul H
Paul H

Reputation: 68146

Let's make your dataframe a little simpler for illustration purposes:

import numpy as np
import pandas as pd
n = 10
np.random.seed(0)
df = pd.DataFrame(
    data = {
        'groupid' : np.random.choice(['one','two'], n),
        'coef' : np.arange(n)
    }, 
    index=pd.Index(np.random.choice(['foo','bar'], n), name='expenditure'),
)
df


             coef groupid
expenditure              
bar             0     one
foo             1     two
foo             2     two
bar             3     one
foo             4     two
foo             5     two
foo             6     two
foo             7     two
foo             8     two
bar             9     two

You can compute the mean coef for each expenditure group using two different methods:

means = df['coef'].mean(level='expenditure')

or

means = df['coef'].groupby(level='expenditure').mean()

Both give me:

expenditure
bar            4.000000
foo            4.714286
Name: coef, dtype: float64

So then we can divide the coef column by the grouped means and broadcast that against the expenditure values:

test = df['coef'].div(means, level='expenditure')
test

expenditure
bar            0.000000
bar            0.750000
bar            2.250000
foo            0.212121
foo            0.424242
foo            0.848485
foo            1.060606
foo            1.272727
foo            1.484848
foo            1.696970
Name: coef, dtype: float64

Our original values in the bar group are 0, 3, and 9, so results of 0.0, 0.75, 2.25 check out.

Upvotes: 0

JD Long
JD Long

Reputation: 60746

It's not obvious to me which version of pandas you're using, but your apply does not work for me at all.

I have trouble with grouping on the index. So I always dump the index and group on a normal column:

df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
                   'groupid' : np.random.choice(['one','two'], n),
                  'coef' : np.random.randn(n)})

then you can do either:

df.groupby('expenditure').coef.apply(lambda x: x / x.mean())

or the following which is almost exactly what you tried before

df.groupby('expenditure').apply(lambda x: x.coef / x.coef.mean())

Upvotes: 1

Anzel
Anzel

Reputation: 20553

I'm not entirely sure if that helps you, but since you've indexed the column expenditure, you need to groupby this index before applying to achieve what I believe you want, something like this:

import pandas as pd

n = 10
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
                   'groupid' : np.random.choice(['one','two'], n),
                  'coef' : np.random.randn(n)})

df.set_index('expenditure', inplace=True)

# when you try to apply, you need to groupby 'expenditure' -- which is the df.index
test = df.groupby(df.index).apply(lambda x: x['coef'] /x.coef.mean())

test

expenditure
bar            expenditure
bar            2.013101
bar       ...
foo            expenditure
foo            1
Name: coef, dtype...
dtype: object

test.index
Index([u'bar', u'foo'], dtype='object')

Upvotes: 0

Related Questions