Jaffer Wilson
Jaffer Wilson

Reputation: 7273

ngrams results are surprising python

Just trying to play with the ngram library of the Python and I came across an issue which is related to the similarity of the string. The ratio output was a bit confusing. See what I tried:

>>> ngram.NGram.compare('alexp','Alex Cho',N=1)*100
30.0
>>>
>>> ngram.NGram.compare('alexp','Alex Plutzer',N=1)*100
21.428571428571427
>>> ngram.NGram.compare('alexp','Alex Plutzer'.lower(),N=1)*100
41.66666666666667
>>> ngram.NGram.compare('alexp','Alex Cho'.lower(),N=1)*100
44.44444444444444
>>> ngram.NGram.compare('alexp','AlexCho'.lower(),N=1)*100
50.0
>>> ngram.NGram.compare('alexp','AlexPlutzer'.lower(),N=1)*100
45.45454545454545

The most similar must be the one having alexp i.e. Alex Plutzer but the more score is getting assigned to the former one i.e. Alex Cho
What might be done to get an appropriate result, where I get to have the output as Alex Plutzer with high score as compare to the competitive one?

Upvotes: 2

Views: 692

Answers (1)

Quickbeam2k1
Quickbeam2k1

Reputation: 5437

With a bit of domain knowledge, using that you consider 1-grams and curve fitting, I claim that the smiliarity of two strings S and T is computed via

enter image description here

where ngrams just gives the ngrams of a string, the curly braces denotes sets and the bars/pipes denote the count of elements in that set.

So the results you obtain are correct if this formula holds true, thus the results are correct concerning this formula. Maybe what suits your needs better could be the Levensthein-Distance

Maybe you want to check the following stackoverflow thread, additionally, you might want to check if nltk provides the similarity scores you need

Upvotes: 1

Related Questions