Quazi Marufur Rahman
Quazi Marufur Rahman

Reputation: 2623

python nltk returning odd result for wordnet similarity measure

I am trying to find similarity between two words using wordnet of python nltk. Two sample keyword is 'game' and 'leonardo'. First I have extracted all synsets of this two words and cross-matching each synset to find their similarity. Here is my code

from nltk.corpus import wordnet as wn

xx = wn.synsets("game")
yy = wn.synsets("leonardo")
for x in xx:
    for y in yy:
        print x.name
        print x.definition
        print y.name
        print y.definition
        print x.wup_similarity(y)
        print '\n'

Here is the total output:

game.n.01 a contest with rules to determine a winner leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.285714285714

game.n.02 a single play of a sport or other contest leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.285714285714

game.n.03 an amusement or pastime leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.25

game.n.04 animal hunted for food or sport leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.923076923077

game.n.05 (tennis) a division of play during which one player serves leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.222222222222

game.n.06 (games) the score at a particular point or the score needed to win leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.285714285714

game.n.07 the flesh of wild animals that is used for food leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.5

plot.n.01 a secret scheme to do something (especially something underhand or illegal) leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.2

game.n.09 the game equipment needed in order to play a particular game leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.666666666667

game.n.10 your occupation or line of work leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.25

game.n.11 frivolous or trifling behavior leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.222222222222

bet_on.v.01 place a bet on leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) -1

crippled.s.01 disabled in the feet or legs leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) -1

game.s.02 willing to face danger leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) -1

But the similarity between game.n.04 and leonardo.n.01 is really odd. I think the similarity (0.923076923077) should not be so high.

game.n.04

animal hunted for food or sport

leonardo.n.01

Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519)

0.923076923077

Is there any problem with my concept?

Upvotes: 5

Views: 3172

Answers (1)

Aya
Aya

Reputation: 41950

According to the docs, the wup_similarity() method returns...

...a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).

...and...

>>> from nltk.corpus import wordnet as wn
>>> game = wn.synset('game.n.04')
>>> leonardo = wn.synset('leonardo.n.01')
>>> game.lowest_common_hypernyms(leonardo)
[Synset('organism.n.01')]
>>> organism = game.lowest_common_hypernyms(leonardo)[0]
>>> game.shortest_path_distance(organism)
2
>>> leonardo.shortest_path_distance(organism)
3

...which is why it thinks they're similar, although I get...

>>> game.wup_similarity(leonardo)
0.7058823529411765

...which is different for some reason.


Update

I want some measurement which will show that dissimilarity('game', 'chess') is much much less than dissimilarity('game', 'leonardo')

How about something like this...

from nltk.corpus import wordnet as wn
from itertools import product

def compare(word1, word2):
    ss1 = wn.synsets(word1)
    ss2 = wn.synsets(word2)
    return max(s1.path_similarity(s2) for (s1, s2) in product(ss1, ss2))

for word1, word2 in (('game', 'leonardo'), ('game', 'chess')):
    print "Path similarity of %-10s and %-10s is %.2f" % (word1,
                                                          word2,
                                                          compare(word1, word2))

...which prints...

Path similarity of game       and leonardo   is 0.17
Path similarity of game       and chess      is 0.25

Upvotes: 8

Related Questions