Reputation: 214
I've got a somewhat convoluted process of subsetting a df. For a given sequence or rows, I want to return specific values. Specifically, using the df below, the beginning of a sequence is signified by Start_A
, Start_B
, Start_C
. If X
or Y
is located within a given sequence, I want to return either Up
or Down
for the same sequence. If Up
or Down
is not found within the sequence containing X
or Y
, then return Left
or Right
. If neither Up
or Down
nor X
or Y
is found print an error.
import pandas as pd
df = pd.DataFrame({
'Num' : [1,2,3,4,6,7,9,10,12,13,14,15,17,18,19,21,22,23,25,26,27,28,30,31,32],
'Item' : ['Start_A','AB','CD','Left','Start_C','CD','X','Up','Right','Start_C','EF','AB','Y','AB','Down','Left','Start_B','AB','Y','CD','Left','Start_A','AB','CD','Right'],
})
m1 = df['Item'].isin(['X','Y']).cumsum().gt(0)
m2 = df['Item'].isin(['Up','Down']).iloc[::-1].cumsum().gt(0)
df1 = df[m1 & m2]
original df:
Num Item
0 1 Start_A # No X,Y within sequence. drop all
1 2 AB
2 3 CD
3 4 Left
4 6 Start_C # X and Up within sequence.
5 7 CD
6 9 X
7 10 Up
8 12 Right
9 13 Start_C # Y and Down within sequence.
10 14 EF
11 15 AB
12 17 Y
13 18 AB
14 19 Down
15 21 Left
16 22 Start_B # Y within sequence. No Up/Down. But Left is.
17 23 AB
18 25 Y
19 26 CD
20 27 AB
21 27 Left
22 28 Start_A # No X,Y within sequence. drop all
23 30 AB
24 31 CD
25 32 Right
intended output:
Num Item
6 9 X
7 10 Up
12 17 Y
14 19 Down
18 25 Y
21 27 Left
Upvotes: 4
Views: 210
Reputation: 153520
Here's a way of doing it:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Num' : [1,2,3,4,6,7,9,10,12,13,14,15,17,18,19,21,22,23,25,26,27,28,30,31,32],
'Item' : ['Start_A','AB','CD','Left','Start_C','CD','X','Up','Right','Start_C','EF','AB','Y','AB','Down','Left','Start_B','AB','Y','CD','Left','Start_A','AB','CD','Right'],
})
grp = df['Item'].str.startswith('Start_').cumsum()
df['X_Y'] = df['Item'].isin(['X', 'Y'])
df['Up_Down'] = df['Item'].isin(['Up', 'Down'])
df['Left_Right'] = df['Item'].isin(['Left', 'right'])
def f(x):
if x['X_Y'].any():
return pd.concat([x[x['X_Y']], x[x['Up_Down']], x[x['Left_Right']]]).head(2)
df.groupby(grp, group_keys=False).apply(f).drop(['X_Y', 'Up_Down', 'Left_Right'], axis=1)
Output:
Num Item
6 9 X
7 10 Up
12 17 Y
14 19 Down
18 25 Y
20 27 Left
Details:
Upvotes: 2
Reputation: 28729
Added comments within the code; hopefully it should be clear how I reasoned your challenge:
# thanks to Scott Boston for a simpler syntax here
(df.assign(counter = df.Item.str.startswith("Start_").cumsum(),
boolean = lambda df: df.groupby('counter').transform(",".join),
#first phase, X or Y should be present
# if absent, nulls will be introduced
boolean_1 = lambda df: df.boolean.str.extract(r"(X|Y)")
)
.dropna()
# next phase, get them in order of Up, Down, Left, Right
# use extract, since it returns the first match
.assign(boolean_2 = lambda df: df.boolean
.str.extract(r"(Up|Down|Left|Right)"))
# filter and keep the original columns
.query("Item == boolean_1 or Item == boolean_2")
.filter(['Num', 'Item'])
)
Num Item
6 9 X
7 10 Up
12 17 Y
14 19 Down
18 25 Y
20 27 Left
Upvotes: 2
Reputation: 26676
m=df['Item'].str.contains('Start').cumsum()#Create mask
g=df[df.groupby(m)['Item'].transform(lambda x:x.str.contains('X|Y').any())]#new dataframe where each occurance must have X or Y
t=g['Item'].where(g['Item'].str.contains('X|Y|Left|Right',np.nan))#replace any values not included
s=g['Item'].where(g['Item'].str.contains('X|Y|Up|Down',np.nan))#replace any values not included
g=g.assign(drop=np.where(g.groupby(m)['Item'].transform(lambda x:x.isin(['Up','Down']).any()),s,t)).dropna()#Use np.where to enforce contions s and t
print(g)
Num Item drop
6 9 X X
7 10 Up Up
12 17 Y Y
14 19 Down Down
18 25 Y Y
20 27 Left Left
Upvotes: 2