mohitos
mohitos

Reputation: 131

python pandas conditional sum on dataframe which is grouped on multiple columns

To illustrate my problem I have an example dataframe

df = pd.DataFrame({'key1': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1], 
                    'key2': ['a', 'b', 'b', 'c', 'a', 'a', 'a', 'b', 'b', 'c'], 
                    'key3': [10, 5, 15, 10, 5, 10, 20, 10, 20, 5],  
                    'zdata': [2, 4, 2, 4, 3, 5, 6, 5, 5, 6]})

which when

df1 = df.groupby(['key1', 'key2', 'key3'])
df1 = df1.agg({'zdata' : [np.mean]})

produces

                zdata
                mean
key1 key2 key3      
0    a    10       2
     b    5        4 * (new group1)
          15       2 * (new group1)
     c    10       4
1    a    5        3 * (new group2)
          10       5 * (new group2)
          20       6 * (new group2)
     b    10       5
          20       5
     c    5        6

I now need to find the mean of zdata and key3 by grouping key1 and key2 only where key3 has a '5' and any additional number ( ie. the rows marked with an * in the above table)

So the desired result would be

                zdata
                mean
key1 key2 key3      
0    b    10       3
1    a    11.6     4.6

I am certain a lambda function will get there like this example [Python Pandas Conditional Sums but this problem is grouped on multiple columns

Any help would be much appreciated.

Upvotes: 3

Views: 2229

Answers (1)

jezrael
jezrael

Reputation: 862851

I think better is use groupby with mean as agg, because result is DataFrame with no Multiindex in columns. Then you can use filter with any and len and last groupby with mean again:

#instaed of agg use:
df1 = df.groupby(['key1', 'key2', 'key3'], as_index=False)['zdata'].mean()
print df1
   key1 key2  key3  zdata
0     0    a    10      2
1     0    b     5      4
2     0    b    15      2
3     0    c    10      4
4     1    a     5      3
5     1    a    10      5
6     1    a    20      6
7     1    b    10      5
8     1    b    20      5
9     1    c     5      6

#filter rows by conditions
df2 = df1.groupby(['key1', 'key2']).filter(lambda x: (x.key3 == 5).any() & (len(x) > 1))
print df2
   key1 key2  key3  zdata
1     0    b     5      4
2     0    b    15      2
4     1    a     5      3
5     1    a    10      5
6     1    a    20      6

print df2.groupby(['key1', 'key2'], as_index=False).mean()
   key1 key2       key3     zdata
0     0    b  10.000000  3.000000
1     1    a  11.666667  4.666667

Upvotes: 3

Related Questions