Reputation: 7273
Just trying to play with the ngram
library of the Python and I came across an issue which is related to the similarity of the string. The ratio output was a bit confusing. See what I tried:
>>> ngram.NGram.compare('alexp','Alex Cho',N=1)*100
30.0
>>>
>>> ngram.NGram.compare('alexp','Alex Plutzer',N=1)*100
21.428571428571427
>>> ngram.NGram.compare('alexp','Alex Plutzer'.lower(),N=1)*100
41.66666666666667
>>> ngram.NGram.compare('alexp','Alex Cho'.lower(),N=1)*100
44.44444444444444
>>> ngram.NGram.compare('alexp','AlexCho'.lower(),N=1)*100
50.0
>>> ngram.NGram.compare('alexp','AlexPlutzer'.lower(),N=1)*100
45.45454545454545
The most similar must be the one having alexp
i.e. Alex Plutzer
but the more score is getting assigned to the former one i.e. Alex Cho
What might be done to get an appropriate result, where I get to have the output as Alex Plutzer
with high score as compare to the competitive one?
Upvotes: 2
Views: 692
Reputation: 5437
With a bit of domain knowledge, using that you consider 1-grams and curve fitting, I claim that the smiliarity of two strings S and T is computed via
where ngrams just gives the ngrams of a string, the curly braces denotes sets and the bars/pipes denote the count of elements in that set.
So the results you obtain are correct if this formula holds true, thus the results are correct concerning this formula. Maybe what suits your needs better could be the Levensthein-Distance
Maybe you want to check the following stackoverflow thread, additionally, you might want to check if nltk provides the similarity scores you need
Upvotes: 1