Reputation: 3625
I have a series where each element is an empty list:
matches = pd.Series([[]]*4)
and another series of strings:
strs = pd.Series(["word3, xx word1 word1", "yy", "word2. o", "awldkj"])
I want to populate cats
with case insensitive keyword matches from a set of keywords:
terms = ["word1", "Word2", "worD3"]
Currently, I iterate through each search term individually
for tcat in tcats:
tcat_re = rf'\b{tcat}\b'
has_cat = strs.str.contains(tcat_re, case=False)
print(has_cat.sum(), "matches for", tcat)
w_cats = has_cat.map({True: [tcat], False: []})
cats = cats.combine(w_cats, lambda li, li2: li + li2)
which yields the correct solution:
1 matches for word1
1 matches for Word2
1 matches for worD3
In [507]: matches
Out[509]:
0 [word1, worD3]
1 []
2 [Word2]
3 []
Two aspects to notice:
matches
does not matterword1
appears twice in strs.iloc[0]
but only yields 1 match. It's fine if 2 matches are generated since the list can be mapped to a set and then back to listBut much too slowly, since my real word terms
list and strs
series is much much larger. Any way to speed it up?
Upvotes: 2
Views: 73
Reputation: 75100
You can try:
strs.str.findall('(?i){}'.format('|'.join([rf'\b{i}\b' for i in terms]))).map(set)
0 {word1, word3}
1 {}
2 {word2}
3 {}
Or for preserving order:
(strs.str.findall('(?i){}'.format('|'.join([rf'\b{i}\b' for i in terms])))
.map(lambda x: [*dict.fromkeys(x).keys()]))
0 [word3, word1]
1 []
2 [word2]
3 []
Upvotes: 2