How to extract specific sequences from a dataframe?

Question

This question is related to my previous question. Given the following dataframe:

df = 
    ID   TYPE   VD_0   VD_1   VD_2   VD_3   VD_4   VD_5
    1    ABC    V1234  aaa    bbb    456    123    564
    2    DBC    456    A45    aaa    V1234  bbb    564
    3    ABD    456    V1234  bbb    ccc    456    123
    4    ABD    ccc    aaa    123    V1234  SSW    123

There is the following list of target values of VD_0-VD_5:

myList = [V1234,456,A45]

I want to get only those rows in df that have 2 or more "sequencial" occurances of values from myList in columns VD_0-VD_5, BUT it is allowed to have any other values between them (any other that do not belong to myList). For example, these allowed values might be aaa, bbb, ccc, etc.

The result should be this one:

result = 
    ID   TYPE   Col_0   Col_1   Col_2 
    1    ABC    V1234   456
    2    DBC    456     A45     V1234
    3    ABD    456     V1234   456

In the result I want to display only values from myList in columns Col_, while ignoring the rest of values.

The following code works fine, but it does not take into account that it is allowed to have in between any values that do not appear in myList:

subset = df.filter(like='VD_')

df[subset.isin(myList).rolling(2, axis=1).sum().max(axis=1)>=2]

Any help will be appreciated.

jezrael · Accepted Answer

I think you need check lengths of True values in columns and rows of DataFrame by sum of boolean DataFrame.

Then is problem for selecting columns, so need add missing columns from df.columns by reindex with fill_values=True:

myList = ['V1234','456','A45']
subset = df.filter(like='VD_')

subset1 = subset.isin(myList)
mask1 = subset1.sum(axis=1) >= 2
print (mask1)

0     True
1     True
2     True
3    False
dtype: bool

mask2 = subset1.sum() >= 2
print (mask2)
VD_0     True
VD_1     True
VD_2    False
VD_3     True
VD_4    False
VD_5    False
dtype: bool

print (mask2.reindex(df.columns, fill_value=True))
ID       True
TYPE     True
VD_0     True
VD_1     True
VD_2    False
VD_3     True
VD_4    False
VD_5    False
dtype: bool

print (df.loc[mask1, mask2.reindex(df.columns, fill_value=True)])
   ID TYPE   VD_0   VD_1   VD_3
0   1  ABC  V1234    aaa    456
1   2  DBC    456    A45  V1234
2   3  ABD    456  V1234    ccc

How to extract specific sequences from a dataframe?

Answers (2)

Related Questions