Reputation: 16488
Consider the following code, which aims at dividing a column by its grouped mean:
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : np.random.randn(n)})
df.set_index('expenditure', inplace=True)
test = df.groupby(level=0).apply(lambda x: x['coef'] / x.coef.mean())
I like this structure of data before apply
ing, before I am usually able to just df['someNewColumn'] = df.apply(...)
. But strangely, this time, I'm not able to instantly remerge the results.
test
should be indexed by the indices that the groupby
happened, expenditure
. However, it has a double index:
>>> test
expenditure expenditure
bar bar -0.491900
bar -9.332964
bar 8.019472
bar -4.540905
bar 5.627947
bar -0.171765
bar 5.698813
bar 6.476207
bar 8.796249
bar -8.284087
bar 1.426311
bar -1.223377
foo foo 1.900897
foo 7.057078
foo 0.060856
foo 3.850323
foo 2.928085
foo -3.249857
foo 3.176616
foo -1.433766
foo 0.910017
foo 1.395376
foo 1.898315
foo -1.903462
foo -3.590479
Name: coef, dtype: float64
Why does it have a double index and how can I get my normalized column?
>>> test.index
MultiIndex(levels=[[u'bar', u'foo'], [u'bar', u'foo']],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
names=[u'expenditure', u'expenditure'])
My pandas
version is 0.15.0
.
Upvotes: 1
Views: 122
Reputation: 68146
Let's make your dataframe a little simpler for illustration purposes:
import numpy as np
import pandas as pd
n = 10
np.random.seed(0)
df = pd.DataFrame(
data = {
'groupid' : np.random.choice(['one','two'], n),
'coef' : np.arange(n)
},
index=pd.Index(np.random.choice(['foo','bar'], n), name='expenditure'),
)
df
coef groupid
expenditure
bar 0 one
foo 1 two
foo 2 two
bar 3 one
foo 4 two
foo 5 two
foo 6 two
foo 7 two
foo 8 two
bar 9 two
You can compute the mean coef
for each expenditure
group using two different methods:
means = df['coef'].mean(level='expenditure')
or
means = df['coef'].groupby(level='expenditure').mean()
Both give me:
expenditure
bar 4.000000
foo 4.714286
Name: coef, dtype: float64
So then we can divide the coef
column by the grouped means and broadcast that against the expenditure
values:
test = df['coef'].div(means, level='expenditure')
test
expenditure
bar 0.000000
bar 0.750000
bar 2.250000
foo 0.212121
foo 0.424242
foo 0.848485
foo 1.060606
foo 1.272727
foo 1.484848
foo 1.696970
Name: coef, dtype: float64
Our original values in the bar
group are 0, 3, and 9, so results of 0.0, 0.75, 2.25 check out.
Upvotes: 0
Reputation: 60746
It's not obvious to me which version of pandas you're using, but your apply does not work for me at all.
I have trouble with grouping on the index. So I always dump the index and group on a normal column:
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : np.random.randn(n)})
then you can do either:
df.groupby('expenditure').coef.apply(lambda x: x / x.mean())
or the following which is almost exactly what you tried before
df.groupby('expenditure').apply(lambda x: x.coef / x.coef.mean())
Upvotes: 1
Reputation: 20553
I'm not entirely sure if that helps you, but since you've indexed the column expenditure
, you need to groupby this index before applying to achieve what I believe you want, something like this:
import pandas as pd
n = 10
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : np.random.randn(n)})
df.set_index('expenditure', inplace=True)
# when you try to apply, you need to groupby 'expenditure' -- which is the df.index
test = df.groupby(df.index).apply(lambda x: x['coef'] /x.coef.mean())
test
expenditure
bar expenditure
bar 2.013101
bar ...
foo expenditure
foo 1
Name: coef, dtype...
dtype: object
test.index
Index([u'bar', u'foo'], dtype='object')
Upvotes: 0