Reputation: 159
For this example I have records for testing at different site locations.
df = pd.DataFrame({'test_id': [1,2,3,4,5,6,7,8, 9],
'name':['site1', 'site2', 'site3', 'site4', 'site1', 'site2', 'site3', 'site4', 'site5'],
'value':['elevated', 'ok', 'high', 'ok', 'elevated', 'elevated','elevated', 'elevated', 'ok']})
test_id name value
0 1 site1 elevated
1 2 site4 ok
2 3 site2 high
3 4 site3 ok
4 5 site2 elevated
5 6 site1 elevated
6 7 site3 ok
7 8 site4 elevated
8 9 site5 ok
I need to flag rows where sites had 2 or more failed (non-ok) tests.
I approached this as needing grouping and counting and came up with the following:
df['flag'] = df.isin(df.loc[df['value'] != 'ok'].groupby(['name']).filter(lambda x: len(x) > 1))['value']
test_id name value flag
0 1 site1 elevated True
1 2 site2 ok False
2 3 site3 high True
3 4 site4 ok False
4 5 site1 elevated True
5 6 site2 elevated False
6 7 site3 elevated True
7 8 site4 elevated False
8 9 site5 ok False
Is there a better way to approach this problem? I'm no expert in Pandas.
Upvotes: 3
Views: 37
Reputation: 42916
You solution is fine! Although you can speed it up by getting rid of the lambda
function and only use native vectorized methods.
So we first get a boolean array where all rows are not equal to ok
. Then we groupby on name
and get the sum
for each group with transform
. Finally we check which rows have a sum > 1
with Series.gt
which stands for greater than
eg >
.
df['flag'] = df['value'].ne('ok').groupby(df['name']).transform('sum').gt(1)
test_id name value flag
0 1 site1 elevated True
1 2 site2 ok False
2 3 site3 high True
3 4 site4 ok False
4 5 site1 elevated True
5 6 site2 elevated False
6 7 site3 elevated True
7 8 site4 elevated False
8 9 site5 ok False
Upvotes: 3