Reputation: 49
I am trying to find the similar sentences from my data and my code gives me an output that basically ranks the similar sentences like RANK 1, 2 and 3 where Rank 1 will be the highly similar sentence. I used BM25 to find this out For example: Sentence 1: "The person is wearing a red-shirt
Rank 1 : "the boy is wearing a red shirt"
Rank 2 : "the boy is wearing a shirt"
Rank 3 : "the girl is wearing a dress"
I would also want to know the similarity score to find out how similar the sentences are. Would need help there!
Upvotes: 1
Views: 1655
Reputation: 3730
You can use SequenceMatcher
from difflib
from difflib import SequenceMatcher
s = SequenceMatcher(None, "the boy is wearing a red shirt", "the boy is wearing a shirt")
print(s.ratio())
Output
0.9285714285714286 # 1 being max
Or
You can use thefuzz library
fuzz.ratio("the boy is wearing a red shirt", "the boy is wearing a shirt") # 100 being max
Or
You can use jellyfish library
import jellyfish
jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish') # 2
jellyfish.jaro_distance(u'jellyfish', u'smellyfish') # 0.89629629629629
jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs') # 1
You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/python-string-similarity#python-string-similarity
Upvotes: 4