I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings). I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far: difflib.SequenceMatcher from the standard lib. ratio() gives good results, but takes >100ms for 10kb text. quick_ratio() takes only half the time, but the results are sometimes far of the real value. python-Levenshtein : levenshtein is an acceptable metric for my use case, but Levenshtein.ratio('foo', 'bar') is not faster than the SequenceMatcher . Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.

pythonstringpython-3.xlevenshtein-distance

Reputation: 1817

Most efficient string similarity metric function

I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings).

I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far:

difflib.SequenceMatcher from the standard lib. ratio() gives good results, but takes >100ms for 10kb text. quick_ratio() takes only half the time, but the results are sometimes far of the real value.
python-Levenshtein: levenshtein is an acceptable metric for my use case, but Levenshtein.ratio('foo', 'bar') is not faster than the SequenceMatcher.

Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.

Upvotes: 4

Answers (3)

Upasana Parashar

Reputation: 1

Based on a lot of reading up I've been able to do, something like tfidf_matcher worked well for me. Returns the best k matches. Also, it's easily 1000x faster than Fuzzywuzzy.

Upvotes: 0

Danferno

Reputation: 564

I've had some luck with RapidFuzz, I don't know how it compares to the others but it was much faster than thefuzz/fuzzywuzzy.

Don't know if it's applicable for your use-case, but this one of the first things you find when you google fast string similarity python

Upvotes: 1

klamann

Reputation: 1817

edlib seems to be fast enough for my use case.

It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher.

Upvotes: 6

Most efficient string similarity metric function

Answers (3)

Related Questions