How to increase the speed of using fuzzy matching in dataframe?

Question

I want to use fuzzy matching to check if dataframe contain keywords.

However, it is very slow to use apply.

Are there any faster methods?

Can we use str or re?

import regex

result = df['sentence'].apply(lambda x: regex.compile('(keyword){e<4}').findall(x)) #slow

Thank you very much.

cs95 · Accepted Answer

Why're you compiling inside the apply? That literally defeats its purpose. Also, the best way to speed up an apply call is to not use apply.

Without context to what you're actually trying to match, I present to you:

p = regex.compile('(keyword){e<4}')
result = [p.findall(x) for x in df['sentence']]

My tests show that a list comprehension based regex match supersedes str methods in terms of performance. Well, take that with a grain of salt, because it always depends on your data and what you're trying to match.

You may want to consider using re.search instead of findall if you just want a single match (for more performance).

Answers (1)