How to filter rows based on the sequence-related constraint?

Question

I have the following dataframe:

df = 
    ID   TYPE   VD_0   VD_1   VD_2   VD_3
    1    ABC    V1234  456    123    564
    2    DBC    456    A45    123    564
    3    ABD    456    V1234  456    123
    4    ABD    123    V1234  SSW    123

There is the following list of values of VD_0, VD_1, VD_2 and VD_3:

myList = [V1234,456,A45]

I want to get only those rows in df that have 2 sequencial occurances of values from myList in columns VD_0, VD_1, VD_2 and VD_3.

The result is this one:

result = 
    ID   TYPE   VD_0   VD_1   VD_2   VD_3
    1    ABC    V1234  456    123    564
    2    DBC    456    A45    123    564
    3    ABD    456    V1234  456    123

For example, in row with ID 1 the values of VD_0 and VD_1 are equal to V1234 and 456, correspondingly, and both of these values belong to myList. The same logic is applied to rows with ID 2 (456,A45) and 3 (456,V1234).

How can I do it?

Zeugma · Accepted Answer

I agree with the beginning of MaxU's answer, yet, the end should be easier IIUC. The filter you want should get 2 consecutive matches from your list. You can get this answer by saying you want the row by row sum of isin result being at least a value of 2 if you sum them two by two. This is called a 2-period rolling window sum along axis=1. Then you take the max value of each row and the matches have a value greater or equal then 2:

subset = df.filter(like='VD_')

df[subset.isin(myList).rolling(2, axis=1).sum().max(axis=1)>=2]
Out[26]: 
   ID TYPE   VD_0   VD_1 VD_2  VD_3
0   1  ABC  V1234    456  123   564
1   2  DBC    456    A45  123   564
2   3  ABD    456  V1234  456   123

How to filter rows based on the sequence-related constraint?

Answers (2)

Related Questions