Mike
Mike

Reputation: 858

Filter groups on whether the max of one column, subject to a condition on another, is equal to a specific value

I want to slice (or filter for) those groups (Though, I'd settle for rows.) that belong to groups in d2.groupby('g') where the max of the values of s in the group, for rows where test is 1, is 2.

pd.__version__
# '0.24.1'

d2 = pd.DataFrame({'g': [1, 1, 1, 2, 2, 2],
                   'test': [1, 1, 0, 1, 0, 0],
                   's': [1, 2, 3, 1, 2, 3]})

d2
   g  test  s
0  1     1  1
1  1     1  2
2  1     0  3
3  2     1  1
4  2     0  2
5  2     0  3

In plain English: I want to keep groups where the maximum value of s that has a corresponding test value of 1 is 2. For the example below, I want to keep the g group 1 because the second row has test == 1 and s == 2, and the third row has s == 0. Expected output for this data:

   g  test  s
0  1     1  1
1  1     1  2
2  1     0  3

I've tried d2.groupby('g').filter(lambda x: (x.test == 1)) # followed by nonsense, as well as d2.groupby('g')[[d2.s == 1]] # with more nonsense. The latter gives me the message about Series objects being mutable and unhashable. I've tried a number of other equally senseless and useless approaches. How do I groupby('g') and filter by the max of s where test is 1? This seems like it should be easy, but I can't get it. I could get there by adding a column, but that seems like a kludge.

Note: I have searched for the answer for this. I'd gladly take any search hints about how to find any current solutions for this question, if there are any. Thanks!

Upvotes: 3

Views: 57

Answers (2)

cs95
cs95

Reputation: 402663

You can use groupby and transform to compute a mask, something like this:

df[df['s'].where(df['test'] == 1).groupby(df['g']).transform('max') == 2]

   g  test  s
0  1     1  1
1  1     1  2
2  1     0  3

The idea behind Series.where is to let us only consider values of "s" where "test" is 1.

Here's a similar version above inspired by WeNYoBen which will work because "test" is 0 or 1 here.

df[df['s'].mul(df['test']).groupby(df['g']).transform('max').eq(2)]

   g  test  s
0  1     1  1
1  1     1  2
2  1     0  3

Upvotes: 2

BENY
BENY

Reputation: 323306

Using filter

d2.groupby('g').filter(lambda x : (x['s']*x['test']).max()==2)
Out[390]: 
   g  s  test
0  1  1     1
1  1  2     1
2  1  3     0

Another way using isin since filter usually slower in big data frame

s=d2.s.mul(d2.test).groupby(d2['g']).max()==2
d2.loc[d2.g.isin(s.index[s])]
Out[394]: 
   g  s  test
0  1  1     1
1  1  2     1
2  1  3     0

Upvotes: 2

Related Questions