Reputation: 1

Multiple substring search and all should be present in dataframe

I have a list of strings , that I want to search in a particular column in a dataframe. I want to output the rows which has ALL of those values.

I have used the code below:

kw = ['apple','banana','kiwi']
regex_val = "|".join(kw)

df['col'].str.contains(regex_val)

This gives the output of any of the keywords is present. I want to output when ALL are present for a record.

Also the keywords list is not limited to 3. It can vary based on user input.

I have tried the str.contains , but it works if any of the keywords are present.

Upvotes: 0

Answers (1)

mozway

Reputation: 262204

Depending on how you want the words to be identified I see a few options. You could craft a regex with lookaheads, you could extractall words and compute a set, or you could test the words one by one and combine the boolean outputs with numpy.logical_and+reduce:

# option 1: using lookaheads
pattern = ''.join(fr'(?=.*{w})' for w in kw)

df['regex_lookahead'] = df['col'].str.contains(pattern)

# option 2: using extract + groupby
pattern = '|'.join(map(re.escape, kw))
df['regex_groupby'] = (df['col'].str.extractall(f'({pattern})')[0]
                       .groupby(level=0).agg(set).eq(set(kw))
                      #.reindex(df.index, False) # if you want False on no-match 
                      )

# option 3: testing all words independently
import numpy as np
df['contains_reduce'] = np.logical_and.reduce([df['col'].str.contains(w)
                                               for w in kw])

Output:

                 col  regex_lookahead regex_groupby  contains_reduce
0            nothing            False           NaN            False
1             banana            False         False            False
2  kiwi banana apple             True          True             True
3    bananapple kiwi             True         False             True

If you want the first option to only match full words, replace fr'(?=.*{w})' by fr'(?=.*\b{w}\b)'.

                 col  regex_lookahead_words
0            nothing                  False
1             banana                  False
2  kiwi banana apple                   True
3    bananapple kiwi                  False

If you want to match words, you could also avoid any regex, split the string into words and use a set:

df['split_set'] = df['col'].str.split().transform(set).ge(set(kw))

Output:

                 col  split_set
0            nothing      False
1             banana      False
2  kiwi banana apple       True
3    bananapple kiwi      False

Upvotes: 0

Multiple substring search and all should be present in dataframe

Answers (1)

Related Questions