Subset dataframe according to group must contain

Question

I am trying to subset a dataframe based on wether or not groupings contain at least one occurence of certain codes.

Consider this example

import pandas as pd

df = pd.DataFrame({'cId' : [1, 1, 1, 2, 2, 4, 4, 4, 4, 4],
                   'eId' : [1, 1, 1, 1, 1, 1, 1, 2, 2, 3],
                   'code' :['af', 'af', 'la', 'su', 'su', 'af', 'da', 'da', 'la', 'su'],
                   'data' : [1, 2, 3, 5, 3, 5, 2, 5, 2, 1]},
columns=['cId', 'eId', 'code', 'data'])

df
Out[10]: 
   cId  eId code  data
0    1    1   af     1
1    1    1   af     2
2    1    1   la     3
3    2    1   su     5
4    2    1   su     3
5    4    1   af     5
6    4    1   da     2
7    4    2   da     5
8    4    2   la     2
9    4    3   su     1

Entries are to be grouped by cId and eId.

I would like to keep only those groups which have at least one occurence where code in ['af', 'da'].

The end result should be

df
Out[10]: 
   cId  eId code  data
0    1    1   af     1
1    1    1   af     2
2    1    1   la     3
5    4    1   af     5
6    4    1   da     2
7    4    2   da     5
8    4    2   la     2

Any suggestions?

jezrael · Accepted Answer

Use isin for columns, duplicated and last merge:

df = df.loc[df['code'].isin(['af', 'da']), ['cId','eId']].drop_duplicates().merge(df)
print (df)
   cId  eId code  data
0    1    1   af     1
1    1    1   af     2
2    1    1   la     3
3    4    1   af     5
4    4    1   da     2
5    4    2   da     5
6    4    2   la     2

Detail:

print (df.loc[df['code'].isin(['af', 'da']), ['cId','eId']].drop_duplicates())
   cId  eId
0    1    1
5    4    1
7    4    2

Timings:

np.random.seed(45)
N = 100000
df = pd.DataFrame({'cId': np.random.randint(100, size=N),
                   'eId' :np.random.randint(100, size=N),
                   'code': np.random.choice(['af','la','su','da','na'], size=N, p=(0.001,0.2,0.2,0.001,0.598)),
                   'data' :np.random.randint(10, size=N), })


In [68]: %timeit df.loc[df['code'].isin(['af', 'da']), ['cId','eId']].drop_duplicates().merge(df)
100 loops, best of 3: 15.9 ms per loop

In [69]: %timeit df.groupby(['cId', 'eId']).filter(lambda x: x['code'].isin(['af', 'da']).any())
1 loop, best of 3: 4.01 s per loop

In [70]: %timeit df[df.groupby(['cId', 'eId'])['code'].transform(lambda x: x.isin(['af', 'da']).any())]
1 loop, best of 3: 4.05 s per loop

Subset dataframe according to group must contain

Answers (2)

Related Questions