weymouth
weymouth

Reputation: 541

aggregate one level of grouped data in pandas

I've got a set of data in pandas which has been grouped by two factors so I could sum over the groups separately. In other words:

grouped = df.groupby(['A','B'])['C'].sum()

Now, I would like to take the std of this sum "across" B so I can see how this deviation changes for the different A values. How can I do this aggregate operation across one "dimension" or "index" of the grouped data?

I'm new to pandas, so this may be easy... but thanks for the help!

Upvotes: 1

Views: 4687

Answers (1)

jezrael
jezrael

Reputation: 863216

It seems you need groupby with parameter level:

grouped = df.groupby(['A','B'])['C'].sum().groupby(level='B').std()

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.randint(5, size=(10,3)), columns=list('ABC'))
print (df)
   A  B  C
0  0  0  3
1  0  2  4
2  2  2  2
3  2  1  0
4  0  4  3
5  4  2  0
6  3  1  2
7  3  4  4
8  1  3  4
9  4  3  3

grouped = df.groupby(['A','B'])['C'].sum().groupby(level='B').std().reset_index()
print (grouped)
   B         C
0  0       NaN
1  1  1.414214
2  2  2.000000
3  3  0.707107
4  4  0.707107

grouped = df.groupby(['A','B'])['C'].sum().groupby(level=1).std().reset_index()
print (grouped)
   B         C
0  0       NaN
1  1  1.414214
2  2  2.000000
3  3  0.707107
4  4  0.707107

Explanation, per partes:

#groupby by columns A, B, aggregate column C 
#->output is Series with MultiIndex
grouped1 = df.groupby(['A','B'])['C'].sum()
print (grouped1)
A  B
0  0    3
   2    4
   4    3
1  3    4
2  1    0
   2    2
3  1    2
   4    4
4  2    0
   3    3
Name: C, dtype: int32

print (type(grouped1))
<class 'pandas.core.series.Series'>

print (grouped1.index)
MultiIndex(levels=[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]],
           labels=[[0, 0, 0, 1, 2, 2, 3, 3, 4, 4], [0, 2, 4, 3, 1, 2, 1, 4, 2, 3]],
           names=['A', 'B'])

#groupby by level B of MultiIndex 
#->output is Series with MultiIndex, so reset_index for df
grouped = grouped1.groupby(level='B').std().reset_index()
print (grouped)
   B         C
0  0       NaN
1  1  1.414214
2  2  2.000000
3  3  0.707107
4  4  0.707107
#all together
grouped = df.groupby(['A','B'])['C'].sum().groupby(level='B').std().reset_index()
print (grouped)
   B         C
0  0       NaN
1  1  1.414214
2  2  2.000000
3  3  0.707107
4  4  0.707107

Upvotes: 1

Related Questions