Reputation: 2889
I am trying to subset a dataframe based on wether or not groupings contain at least one occurence of certain codes.
Consider this example
import pandas as pd
df = pd.DataFrame({'cId' : [1, 1, 1, 2, 2, 4, 4, 4, 4, 4],
'eId' : [1, 1, 1, 1, 1, 1, 1, 2, 2, 3],
'code' :['af', 'af', 'la', 'su', 'su', 'af', 'da', 'da', 'la', 'su'],
'data' : [1, 2, 3, 5, 3, 5, 2, 5, 2, 1]},
columns=['cId', 'eId', 'code', 'data'])
df
Out[10]:
cId eId code data
0 1 1 af 1
1 1 1 af 2
2 1 1 la 3
3 2 1 su 5
4 2 1 su 3
5 4 1 af 5
6 4 1 da 2
7 4 2 da 5
8 4 2 la 2
9 4 3 su 1
Entries are to be grouped by cId
and eId
.
I would like to keep only those groups which have at least one occurence where code
in ['af', 'da']
.
The end result should be
df
Out[10]:
cId eId code data
0 1 1 af 1
1 1 1 af 2
2 1 1 la 3
5 4 1 af 5
6 4 1 da 2
7 4 2 da 5
8 4 2 la 2
Any suggestions?
Upvotes: 1
Views: 27
Reputation: 863166
Use isin
for columns, duplicated
and last merge
:
df = df.loc[df['code'].isin(['af', 'da']), ['cId','eId']].drop_duplicates().merge(df)
print (df)
cId eId code data
0 1 1 af 1
1 1 1 af 2
2 1 1 la 3
3 4 1 af 5
4 4 1 da 2
5 4 2 da 5
6 4 2 la 2
Detail:
print (df.loc[df['code'].isin(['af', 'da']), ['cId','eId']].drop_duplicates())
cId eId
0 1 1
5 4 1
7 4 2
Timings:
np.random.seed(45)
N = 100000
df = pd.DataFrame({'cId': np.random.randint(100, size=N),
'eId' :np.random.randint(100, size=N),
'code': np.random.choice(['af','la','su','da','na'], size=N, p=(0.001,0.2,0.2,0.001,0.598)),
'data' :np.random.randint(10, size=N), })
In [68]: %timeit df.loc[df['code'].isin(['af', 'da']), ['cId','eId']].drop_duplicates().merge(df)
100 loops, best of 3: 15.9 ms per loop
In [69]: %timeit df.groupby(['cId', 'eId']).filter(lambda x: x['code'].isin(['af', 'da']).any())
1 loop, best of 3: 4.01 s per loop
In [70]: %timeit df[df.groupby(['cId', 'eId'])['code'].transform(lambda x: x.isin(['af', 'da']).any())]
1 loop, best of 3: 4.05 s per loop
Upvotes: 1
Reputation: 76947
Options 1
Use filter
In [610]: df.groupby(['cId', 'eId']).filter(lambda x: x['code'].isin(['af', 'da']).any())
Out[610]:
cId eId code data
0 1 1 af 1
1 1 1 af 2
2 1 1 la 3
5 4 1 af 5
6 4 1 da 2
7 4 2 da 5
8 4 2 la 2
Options 2
Use transform
In [612]: df[df.groupby(['cId', 'eId'])['code'].transform(lambda x: x.isin(['af', 'da']).any())]
Out[612]:
cId eId code data
0 1 1 af 1
1 1 1 af 2
2 1 1 la 3
5 4 1 af 5
6 4 1 da 2
7 4 2 da 5
8 4 2 la 2
Upvotes: 1