Reputation: 2623
I am trying to find similarity between two words using wordnet of python nltk. Two sample keyword is 'game' and 'leonardo'. First I have extracted all synsets of this two words and cross-matching each synset to find their similarity. Here is my code
from nltk.corpus import wordnet as wn
xx = wn.synsets("game")
yy = wn.synsets("leonardo")
for x in xx:
for y in yy:
print x.name
print x.definition
print y.name
print y.definition
print x.wup_similarity(y)
print '\n'
Here is the total output:
game.n.01 a contest with rules to determine a winner leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.285714285714
game.n.02 a single play of a sport or other contest leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.285714285714
game.n.03 an amusement or pastime leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.25
game.n.04 animal hunted for food or sport leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.923076923077
game.n.05 (tennis) a division of play during which one player serves leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.222222222222
game.n.06 (games) the score at a particular point or the score needed to win leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.285714285714
game.n.07 the flesh of wild animals that is used for food leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.5
plot.n.01 a secret scheme to do something (especially something underhand or illegal) leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.2
game.n.09 the game equipment needed in order to play a particular game leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.666666666667
game.n.10 your occupation or line of work leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.25
game.n.11 frivolous or trifling behavior leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) 0.222222222222
bet_on.v.01 place a bet on leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) -1
crippled.s.01 disabled in the feet or legs leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) -1
game.s.02 willing to face danger leonardo.n.01 Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519) -1
But the similarity between game.n.04 and leonardo.n.01 is really odd. I think the similarity (0.923076923077) should not be so high.
game.n.04
animal hunted for food or sport
leonardo.n.01
Italian painter and sculptor and engineer and scientist and architect; the most versatile genius of the Italian Renaissance (1452-1519)
0.923076923077
Is there any problem with my concept?
Upvotes: 5
Views: 3172
Reputation: 41950
According to the docs, the wup_similarity()
method returns...
...a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
...and...
>>> from nltk.corpus import wordnet as wn
>>> game = wn.synset('game.n.04')
>>> leonardo = wn.synset('leonardo.n.01')
>>> game.lowest_common_hypernyms(leonardo)
[Synset('organism.n.01')]
>>> organism = game.lowest_common_hypernyms(leonardo)[0]
>>> game.shortest_path_distance(organism)
2
>>> leonardo.shortest_path_distance(organism)
3
...which is why it thinks they're similar, although I get...
>>> game.wup_similarity(leonardo)
0.7058823529411765
...which is different for some reason.
Update
I want some measurement which will show that dissimilarity('game', 'chess') is much much less than dissimilarity('game', 'leonardo')
How about something like this...
from nltk.corpus import wordnet as wn
from itertools import product
def compare(word1, word2):
ss1 = wn.synsets(word1)
ss2 = wn.synsets(word2)
return max(s1.path_similarity(s2) for (s1, s2) in product(ss1, ss2))
for word1, word2 in (('game', 'leonardo'), ('game', 'chess')):
print "Path similarity of %-10s and %-10s is %.2f" % (word1,
word2,
compare(word1, word2))
...which prints...
Path similarity of game and leonardo is 0.17
Path similarity of game and chess is 0.25
Upvotes: 8