Reputation: 1817
I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings).
I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far:
difflib.SequenceMatcher
from the standard lib. ratio()
gives good results, but takes >100ms for 10kb text. quick_ratio()
takes only half the time, but the results are sometimes far of the real value.python-Levenshtein
: levenshtein is an acceptable metric for my use case, but Levenshtein.ratio('foo', 'bar')
is not faster than the SequenceMatcher
.Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.
Upvotes: 4
Views: 6638
Reputation: 1
Based on a lot of reading up I've been able to do, something like tfidf_matcher worked well for me. Returns the best k matches. Also, it's easily 1000x faster than Fuzzywuzzy.
Upvotes: 0
Reputation: 565
I've had some luck with RapidFuzz, I don't know how it compares to the others but it was much faster than thefuzz/fuzzywuzzy.
Don't know if it's applicable for your use-case, but this one of the first things you find when you google fast string similarity python
Upvotes: 1