Reputation: 131
To illustrate my problem I have an example dataframe
df = pd.DataFrame({'key1': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
'key2': ['a', 'b', 'b', 'c', 'a', 'a', 'a', 'b', 'b', 'c'],
'key3': [10, 5, 15, 10, 5, 10, 20, 10, 20, 5],
'zdata': [2, 4, 2, 4, 3, 5, 6, 5, 5, 6]})
which when
df1 = df.groupby(['key1', 'key2', 'key3'])
df1 = df1.agg({'zdata' : [np.mean]})
produces
zdata
mean
key1 key2 key3
0 a 10 2
b 5 4 * (new group1)
15 2 * (new group1)
c 10 4
1 a 5 3 * (new group2)
10 5 * (new group2)
20 6 * (new group2)
b 10 5
20 5
c 5 6
I now need to find the mean of zdata and key3 by grouping key1 and key2 only where key3 has a '5' and any additional number ( ie. the rows marked with an * in the above table)
So the desired result would be
zdata
mean
key1 key2 key3
0 b 10 3
1 a 11.6 4.6
I am certain a lambda function will get there like this example [Python Pandas Conditional Sums but this problem is grouped on multiple columns
Any help would be much appreciated.
Upvotes: 3
Views: 2229
Reputation: 862851
I think better is use groupby
with mean
as agg
, because result is DataFrame
with no Multiindex
in columns. Then you can use filter
with any
and len
and last groupby
with mean
again:
#instaed of agg use:
df1 = df.groupby(['key1', 'key2', 'key3'], as_index=False)['zdata'].mean()
print df1
key1 key2 key3 zdata
0 0 a 10 2
1 0 b 5 4
2 0 b 15 2
3 0 c 10 4
4 1 a 5 3
5 1 a 10 5
6 1 a 20 6
7 1 b 10 5
8 1 b 20 5
9 1 c 5 6
#filter rows by conditions
df2 = df1.groupby(['key1', 'key2']).filter(lambda x: (x.key3 == 5).any() & (len(x) > 1))
print df2
key1 key2 key3 zdata
1 0 b 5 4
2 0 b 15 2
4 1 a 5 3
5 1 a 10 5
6 1 a 20 6
print df2.groupby(['key1', 'key2'], as_index=False).mean()
key1 key2 key3 zdata
0 0 b 10.000000 3.000000
1 1 a 11.666667 4.666667
Upvotes: 3