Dinosaurius
Dinosaurius

Reputation: 8628

How to filter rows based on the sequence-related constraint?

I have the following dataframe:

df = 
    ID   TYPE   VD_0   VD_1   VD_2   VD_3
    1    ABC    V1234  456    123    564
    2    DBC    456    A45    123    564
    3    ABD    456    V1234  456    123
    4    ABD    123    V1234  SSW    123

There is the following list of values of VD_0, VD_1, VD_2 and VD_3:

myList = [V1234,456,A45]

I want to get only those rows in df that have 2 sequencial occurances of values from myList in columns VD_0, VD_1, VD_2 and VD_3.

The result is this one:

result = 
    ID   TYPE   VD_0   VD_1   VD_2   VD_3
    1    ABC    V1234  456    123    564
    2    DBC    456    A45    123    564
    3    ABD    456    V1234  456    123

For example, in row with ID 1 the values of VD_0 and VD_1 are equal to V1234 and 456, correspondingly, and both of these values belong to myList. The same logic is applied to rows with ID 2 (456,A45) and 3 (456,V1234).

How can I do it?

Upvotes: 4

Views: 84

Answers (2)

Zeugma
Zeugma

Reputation: 32105

I agree with the beginning of MaxU's answer, yet, the end should be easier IIUC. The filter you want should get 2 consecutive matches from your list. You can get this answer by saying you want the row by row sum of isin result being at least a value of 2 if you sum them two by two. This is called a 2-period rolling window sum along axis=1. Then you take the max value of each row and the matches have a value greater or equal then 2:

subset = df.filter(like='VD_')

df[subset.isin(myList).rolling(2, axis=1).sum().max(axis=1)>=2]
Out[26]: 
   ID TYPE   VD_0   VD_1 VD_2  VD_3
0   1  ABC  V1234    456  123   564
1   2  DBC    456    A45  123   564
2   3  ABD    456  V1234  456   123

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

Try this:

In [112]: subset = df.filter(like='VD_')

In [113]: df[subset[subset.isin(myList)].stack().duplicated().unstack().any(1)]
Out[113]:
   ID TYPE VD_0   VD_1 VD_2  VD_3
1   2  DBC  456    A45  123   564
2   3  ABD  456  V1234  456   123
3   4  ABD  123  V1234  SSW   123

Explanations:

In [114]: subset
Out[114]:
    VD_0   VD_1 VD_2  VD_3
0  V1234    456  123   564
1    456    A45  123   564
2    456  V1234  456   123
3    123  V1234  SSW   123

In [115]: subset.isin(myList)
Out[115]:
    VD_0  VD_1   VD_2   VD_3
0   True  True  False  False
1   True  True  False  False
2   True  True   True  False
3  False  True  False  False

In [116]: subset[subset.isin(myList)]
Out[116]:
    VD_0   VD_1 VD_2  VD_3
0  V1234    456  NaN   NaN
1    456    A45  NaN   NaN
2    456  V1234  456   NaN
3    NaN  V1234  NaN   NaN

In [118]: subset[subset.isin(myList)].stack()
Out[118]:
0  VD_0    V1234
   VD_1      456
1  VD_0      456
   VD_1      A45
2  VD_0      456
   VD_1    V1234
   VD_2      456
3  VD_1    V1234
dtype: object

In [119]: subset[subset.isin(myList)].stack().duplicated()
Out[119]:
0  VD_0    False
   VD_1    False
1  VD_0     True
   VD_1    False
2  VD_0     True
   VD_1     True
   VD_2     True
3  VD_1     True
dtype: bool

In [120]: subset[subset.isin(myList)].stack().duplicated().unstack()
Out[120]:
    VD_0   VD_1  VD_2
0  False  False  None
1   True  False  None
2   True   True  True
3   None   True  None

In [121]: subset[subset.isin(myList)].stack().duplicated().unstack().any(1)
Out[121]:
0    False
1     True
2     True
3     True
dtype: bool

Upvotes: 1

Related Questions