Similarity measure for Strings in Python

Question

I want to measure the similarity between two words. The idea is to read a text with OCR and check the result for keywords. The function I'm looking for should compare two words and return the similarity in %. So comparing a word with itself should be 100% similar. I wrote a function on my own and compared char by char and returned the number of matches in ratio to the length. But the Problem is that

wordComp('h0t',hot')
0.66
wordComp('tackoverflow','stackoverflow')
0

But intuitive both examples should have very high similarity >90%. Adding the Levenstein-Distance

import nltk
nltk.edit_distance('word1','word2')

in my function will increase the second result up to 92% but the first result is still not good.

I already found this solution for "R" and it would be possible to use this functions with rpy2 or use agrepy as another approach. But I want to make the program more and less sensitive by changing the benchmark for acceptance (Only accept matches with similarity > x%).

Is there another good measure I could use or do you have any ideas to improve my function?

ragamuffin · Accepted Answer

You could just use difflib. This function I got from an answer some time ago has served me well:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

print (similar('tackoverflow','stackoverflow'))
print (similar('h0t','hot'))

0.96
0.666666666667

You could easily append the function or wrap it in another function to account for different degrees of similarities, like so, passing a third argument:

from difflib import SequenceMatcher

def similar(a, b, c):
    sim = SequenceMatcher(None, a, b).ratio()
    if sim > c: 
        return sim

print (similar('tackoverflow','stackoverflow', 0.9))
print (similar('h0t','hot', 0.9))

0.96
None

Similarity measure for Strings in Python

Answers (2)

Related Questions