GZ-
GZ-

Reputation: 247

Pandas Groupby -- efficient selection/filtering of groups based on multiple conditions?

I am trying to

but I cannot seem to get to a fast Pandas 'native' one-liner.

Here I generate an example dataframe of 2*n*n rows and 4 columns:

import itertools
import random

n = 100
lst = range(0, n)
df = pd.DataFrame(
    {'A': list(itertools.chain.from_iterable(itertools.repeat(x, n*2) for x in lst)), 
     'B': list(itertools.chain.from_iterable(itertools.repeat(x, 1*2) for x in lst)) * n,
     'C': random.choices(list(range(100)), k=2*n*n),
     'D': random.choices(list(range(100)), k=2*n*n)
    })

resulting in dataframes such as:

    A   B   C   D
0   0   0   26  49
1   0   0   29  80
2   0   1   70  92
3   0   1   7   2
4   1   0   90  11
5   1   0   19  4
6   1   1   29  4
7   1   1   31  95

I want to

A "native" Pandas one-liner would be the following:

test.groupby([test.A, test.B]).filter(lambda x: ((x.C>50).any() & (x.D>50).any()) )

which produces

    A   B   C   D
2   0   1   70  92
3   0   1   7   2

This is all fine for small dataframes (say n < 20). But this solution takes quite long (for example, 4.58 s when n = 100) for large dataframes.

I have an alternative, step-by-step solution which achieves the same result, but runs much faster (28.1 ms when n = 100):

test_g = test.assign(key_C = test.C>50, key_D = test.D>50).groupby([test.A, test.B])
test_C_bool = test_g.key_C.transform('any')
test_D_bool = test_g.key_D.transform('any')
test[test_C_bool & test_D_bool]

but arguably a bit more ugly. My questions are:

  1. Is there a better "native" Pandas solution for this task? , and
  2. Is there a reason for the sub-optimal performance of my version of the "native" solution?

Bonus question:

In fact I only want to extract the groups and not together with their data. I.e., I only need

    A   B
    0   1

in the above example. Is there a way to do this with Pandas without going through the intermediate step I did above?

Upvotes: 1

Views: 177

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150785

This is similar to your second approach, but chained together:

mask = (df[['C','D']].gt(50)         # in the case you have different thresholds for `C`, `D` [50, 60]
         .all(axis=1)                # check for both True on the rows
         .groupby([df['A'],df['B']]) # normal groupby
         .transform('max')           # 'any' instead of 'max' also works
        )

df.loc[mask]

If you don't want the data, you can forgo the transform:

mask = df[['C','D']].min(axis=1).gt(50).groupby([df['A'],df['B']]).any()

mask[mask].index

# out
# MultiIndex([(0, 1)],
#           names=['A', 'B'])

Upvotes: 3

Related Questions