Reputation: 8628
This question is related to my previous question. Given the following dataframe:
df =
ID TYPE VD_0 VD_1 VD_2 VD_3 VD_4 VD_5
1 ABC V1234 aaa bbb 456 123 564
2 DBC 456 A45 aaa V1234 bbb 564
3 ABD 456 V1234 bbb ccc 456 123
4 ABD ccc aaa 123 V1234 SSW 123
There is the following list of target values of VD_0
-VD_5
:
myList = [V1234,456,A45]
I want to get only those rows in df
that have 2 or more "sequencial" occurances of values from myList
in columns VD_0
-VD_5
, BUT it is allowed to have any other values between them (any other that do not belong to myList
). For example, these allowed values might be aaa
, bbb
, ccc
, etc.
The result should be this one:
result =
ID TYPE Col_0 Col_1 Col_2
1 ABC V1234 456
2 DBC 456 A45 V1234
3 ABD 456 V1234 456
In the result
I want to display only values from myList
in columns Col_
, while ignoring the rest of values.
The following code works fine, but it does not take into account that it is allowed to have in between any values that do not appear in myList
:
subset = df.filter(like='VD_')
df[subset.isin(myList).rolling(2, axis=1).sum().max(axis=1)>=2]
Any help will be appreciated.
Upvotes: 1
Views: 542
Reputation: 862611
I think you need check lengths of True
values in columns and rows of DataFrame
by sum
of boolean DataFrame
.
Then is problem for selecting columns, so need add missing columns from df.columns
by reindex
with fill_values=True
:
myList = ['V1234','456','A45']
subset = df.filter(like='VD_')
subset1 = subset.isin(myList)
mask1 = subset1.sum(axis=1) >= 2
print (mask1)
0 True
1 True
2 True
3 False
dtype: bool
mask2 = subset1.sum() >= 2
print (mask2)
VD_0 True
VD_1 True
VD_2 False
VD_3 True
VD_4 False
VD_5 False
dtype: bool
print (mask2.reindex(df.columns, fill_value=True))
ID True
TYPE True
VD_0 True
VD_1 True
VD_2 False
VD_3 True
VD_4 False
VD_5 False
dtype: bool
print (df.loc[mask1, mask2.reindex(df.columns, fill_value=True)])
ID TYPE VD_0 VD_1 VD_3
0 1 ABC V1234 aaa 456
1 2 DBC 456 A45 V1234
2 3 ABD 456 V1234 ccc
Upvotes: 2
Reputation: 76917
Here's another way.
In [903]: df.apply(lambda x: [y for y in x if y in myList], axis=1)
Out[903]:
0 [V1234, 456]
1 [456, A45, V1234]
2 [456, V1234, 456]
3 [V1234]
dtype: object
In [904]: s = df.apply(lambda x: [y for y in x if y in myList], axis=1)
In [905]: s[s.apply(len) >= 2]
Out[905]:
0 [V1234, 456]
1 [456, A45, V1234]
2 [456, V1234, 456]
dtype: object
In [906]: s[s.apply(len) >= 2].apply(pd.Series)
Out[906]:
0 1 2
0 V1234 456 NaN
1 456 A45 V1234
2 456 V1234 456
Upvotes: 1