Reputation: 83
I have the following dataframe:
df = pd.DataFrame({'Subject': [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,5], 'dx1': ['CN','AD','AD','CN','UNC','AD','AD','CN','CN','CN','AD','CN','CN','CN','AD','AD']})
For each Subject I want to keep the previous or the following row where dx1 == 'CN', when dx1 == 'AD. Note that for subjects 3,4 and 5 there are more than one cases of 'CN'. In these cases I want to keep the closest row to the first appearance of dx1 == 'AD'.
In the following dataframe are marked the rows that I want to keep.
Upvotes: 1
Views: 61
Reputation: 863226
Idea is remove not rows with AD, CN
by filtering in first step and then filter pattern AD, CN
and CN, AD
per groups:
#create deafult index - used for matching final ouput
df = df.reset_index(drop=True)
df1 = df[df['dx1'].isin(['AD','CN'])]
s1 = df1.groupby('Subject')['dx1'].shift(-1)
s2 = df1.groupby('Subject')['dx1'].shift()
cn = df1['dx1'].eq('CN')
ad = df1['dx1'].eq('AD')
m1 = cn & s1.eq('AD')
m2 = ad & s2.eq('CN')
m11 = ad & s1.eq('CN')
m22 = cn & s2.eq('AD')
df2 = df1[m1 | m2 | m11 | m22].copy()
print (df2)
Subject dx1
0 1 CN
1 1 AD
3 2 CN
5 2 AD
6 3 AD
7 3 CN
9 4 CN
10 4 AD
11 4 CN
13 5 CN
14 5 AD
Then filter only CN
values and drop duplicates by both columns for final values:
mask = df2['dx1'].eq('CN')
idx = df2[mask].drop_duplicates(subset=['Subject','dx1']).index
df3 = df.loc[idx]
print (df3)
Subject dx1
0 1 CN
3 2 CN
7 3 CN
9 4 CN
13 5 CN
Or for new column:
df['new'] = df.index.isin(idx)
print (df)
Subject dx1 new
0 1 CN True
1 1 AD False
2 1 AD False
3 2 CN True
4 2 UNC False
5 2 AD False
6 3 AD False
7 3 CN True
8 3 CN False
9 4 CN True
10 4 AD False
11 4 CN False
12 5 CN False
13 5 CN True
14 5 AD False
15 5 AD False
Upvotes: 1