klamann
klamann

Reputation: 1817

Most efficient string similarity metric function

I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings).

I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far:

Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.

Upvotes: 4

Views: 6638

Answers (3)

Upasana Parashar
Upasana Parashar

Reputation: 1

Based on a lot of reading up I've been able to do, something like tfidf_matcher worked well for me. Returns the best k matches. Also, it's easily 1000x faster than Fuzzywuzzy.

Upvotes: 0

Danferno
Danferno

Reputation: 565

I've had some luck with RapidFuzz, I don't know how it compares to the others but it was much faster than thefuzz/fuzzywuzzy.

Don't know if it's applicable for your use-case, but this one of the first things you find when you google fast string similarity python

Upvotes: 1

klamann
klamann

Reputation: 1817

edlib seems to be fast enough for my use case.

It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher.

Upvotes: 6

Related Questions