Reputation: 1
I have a list of strings , that I want to search in a particular column in a dataframe. I want to output the rows which has ALL of those values.
I have used the code below:
kw = ['apple','banana','kiwi']
regex_val = "|".join(kw)
df['col'].str.contains(regex_val)
This gives the output of any of the keywords is present. I want to output when ALL are present for a record.
Also the keywords list is not limited to 3. It can vary based on user input.
I have tried the str.contains , but it works if any of the keywords are present.
Upvotes: 0
Views: 35
Reputation: 262204
Depending on how you want the words to be identified I see a few options. You could craft a regex with lookaheads, you could extractall
words and compute a set
, or you could test the words one by one and combine the boolean outputs with numpy.logical_and
+reduce
:
# option 1: using lookaheads
pattern = ''.join(fr'(?=.*{w})' for w in kw)
df['regex_lookahead'] = df['col'].str.contains(pattern)
# option 2: using extract + groupby
pattern = '|'.join(map(re.escape, kw))
df['regex_groupby'] = (df['col'].str.extractall(f'({pattern})')[0]
.groupby(level=0).agg(set).eq(set(kw))
#.reindex(df.index, False) # if you want False on no-match
)
# option 3: testing all words independently
import numpy as np
df['contains_reduce'] = np.logical_and.reduce([df['col'].str.contains(w)
for w in kw])
Output:
col regex_lookahead regex_groupby contains_reduce
0 nothing False NaN False
1 banana False False False
2 kiwi banana apple True True True
3 bananapple kiwi True False True
If you want the first option to only match full words, replace fr'(?=.*{w})'
by fr'(?=.*\b{w}\b)'
.
col regex_lookahead_words
0 nothing False
1 banana False
2 kiwi banana apple True
3 bananapple kiwi False
If you want to match words, you could also avoid any regex, split
the string into words and use a set
:
df['split_set'] = df['col'].str.split().transform(set).ge(set(kw))
Output:
col split_set
0 nothing False
1 banana False
2 kiwi banana apple True
3 bananapple kiwi False
Upvotes: 0