Reputation: 413
I want to measure the similarity between two words. The idea is to read a text with OCR and check the result for keywords. The function I'm looking for should compare two words and return the similarity in %. So comparing a word with itself should be 100% similar. I wrote a function on my own and compared char by char and returned the number of matches in ratio to the length. But the Problem is that
wordComp('h0t',hot')
0.66
wordComp('tackoverflow','stackoverflow')
0
But intuitive both examples should have very high similarity >90%. Adding the Levenstein-Distance
import nltk
nltk.edit_distance('word1','word2')
in my function will increase the second result up to 92% but the first result is still not good.
I already found this solution for "R" and it would be possible to use this functions with rpy2
or use agrepy
as another approach. But I want to make the program more and less sensitive by changing the benchmark for acceptance (Only accept matches with similarity > x%).
Is there another good measure I could use or do you have any ideas to improve my function?
Upvotes: 4
Views: 5302
Reputation: 673
I wrote the following code. try it. I defined a str3 for those occasions that length of two comparing string(str1 and str2) is not equal. the code is in while loop for exiting use k input.
k=1
cnt=0
str3=''
while not k==-1:
str1=input()
str2=input()
k=int(input())
if len(str1)>len(str2):
str3=str1[0:len(str2)]
for j in range(0,len(str3)):
if str3[j]==str2[j]:
cnt+=1
print((cnt/len(str1)*100))
elif len(str1)<len(str2):
str3=str2[0:len(str1)]
for j in range(0,len(str2)):
if str3[j]==str1[j]:
cnt+=1
print((cnt/len(str2)*100))
else:
for j in range(0,len(str2)):
if str2[j]==str1[j]:
cnt+=1
print((cnt/len(str1)*100))
Upvotes: 0
Reputation: 460
You could just use difflib. This function I got from an answer some time ago has served me well:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
print (similar('tackoverflow','stackoverflow'))
print (similar('h0t','hot'))
0.96
0.666666666667
You could easily append the function or wrap it in another function to account for different degrees of similarities, like so, passing a third argument:
from difflib import SequenceMatcher
def similar(a, b, c):
sim = SequenceMatcher(None, a, b).ratio()
if sim > c:
return sim
print (similar('tackoverflow','stackoverflow', 0.9))
print (similar('h0t','hot', 0.9))
0.96
None
Upvotes: 9