Pandas: Speeding up many string searches

Question

I have a series where each element is an empty list:

matches = pd.Series([[]]*4)

and another series of strings:

strs = pd.Series(["word3, xx word1 word1", "yy", "word2. o", "awldkj"])

I want to populate cats with case insensitive keyword matches from a set of keywords:

terms = ["word1", "Word2", "worD3"]

Currently, I iterate through each search term individually

    for tcat in tcats:
        tcat_re = rf'\b{tcat}\b'
        has_cat = strs.str.contains(tcat_re, case=False)
        print(has_cat.sum(), "matches for", tcat)
        w_cats = has_cat.map({True: [tcat], False: []})
        cats = cats.combine(w_cats, lambda li, li2: li + li2)

which yields the correct solution:

1 matches for word1
1 matches for Word2
1 matches for worD3

In [507]: matches
Out[509]: 
0    [word1, worD3]
1                []
2           [Word2]
3                []

Two aspects to notice:

The order of matching terms in matches does not matter
word1 appears twice in strs.iloc[0] but only yields 1 match. It's fine if 2 matches are generated since the list can be mapped to a set and then back to list

But much too slowly, since my real word terms list and strs series is much much larger. Any way to speed it up?

anky · Accepted Answer

You can try:

strs.str.findall('(?i){}'.format('|'.join([rf'\b{i}\b' for i in terms]))).map(set)

0    {word1, word3}
1                {}
2           {word2}
3                {}

Or for preserving order:

(strs.str.findall('(?i){}'.format('|'.join([rf'\b{i}\b' for i in terms])))
                               .map(lambda x: [*dict.fromkeys(x).keys()]))

0    [word3, word1]
1                []
2           [word2]
3                []

Pandas: Speeding up many string searches

Answers (1)

Related Questions