GG_Python
GG_Python

Reputation: 3541

Compare similarity of terms/expressions using NLTK?

I'm trying to compare terms/expressions which would (or not) be semantically related - these are not full sentences, and not necessarily single words; e.g. -

'Social networking service' and 'Social network' are clearly strongly related, but how to i quantify this using nltk?

Clearly i'm missing something as even the code:

w1 = wordnet.synsets('social network')

returns an empty list.

Any advice on how to tackle this?

Upvotes: 6

Views: 14331

Answers (5)

alvas
alvas

Reputation: 122260

Possibly you would need a WSD module that would return a wordnet Synset object from NLTK. If so, you can take a look at this: https://github.com/alvations/pywsd

$ wget https://github.com/alvations/pywsd/archive/master.zip
$ unzip master.zip
$ cd pywsd/
$ ls
baseline.py  cosine.py  lesk.py  README.md  similarity.py  test_wsd.py
$ python
>>> from similarity import max_similarity
>>> sent = 'I went to the bank to deposit my money'
>>> sim_choice = "lin" # Using Lin's (1998) similarity measure.
>>> print "Context:", sent
>>> print "Similarity:", sim_choice 
>>> answer = max_similarity(sent, 'bank', sim_choice)
>>> print "Sense:", answer
>>> print "Definition", answer.definition

[out]:

Context: I went to the bank to deposit my money
Similarity: lch
Sense: Synset('bank.n.09')
Definition a building in which the business of banking transacted

Upvotes: 2

eyquem
eyquem

Reputation: 27585

import difflib

sm = difflib.SequenceMatcher(None)

sm.set_seq2('Social network')
#SequenceMatcher computes and caches detailed information
#about the second sequence, so if you want to compare one
#sequence against many sequences, use set_seq2() to set
#the commonly used sequence once and call set_seq1()
#repeatedly, once for each of the other sequences.
# (the doc)

for x in ('Social networking service',
          'Social working service',
          'Social ocean',
          'Atlantic ocean',
          'Atlantic and arctic oceans'):
    sm.set_seq1(x)
    print x,sm.ratio()

result

Social networking service 0.717948717949
Social working service 0.611111111111
Social ocean 0.615384615385
Atlantic ocean 0.214285714286
Atlantic and arctic oceans 0.15

Upvotes: 1

Leo
Leo

Reputation: 11

https://www.mashape.com/amtera/esa-semantic-relatedness

This is a web API to calculate semantic relatedness between pair of words or text excerpts..

Upvotes: 1

Somum
Somum

Reputation: 2422

Here is a solution you can use.

     w1 = wordnet.synsets('social')
     w2 = wordnet.synsets('network')

w1 and w2 will have an array of synsets. Find the similarity between each synset of w1 with w2. The one with maximum similarity give you combined synset (which is what you are looking for).

Here is the full code

from nltk.corpus import wordnet
x = 'social'
y = 'network'
xsyn = wordnet.synsets(x)
# xsyn
#[Synset('sociable.n.01'), Synset('social.a.01'), Synset('social.a.02'),   
#Synset('social.a.03'), Synset('social.s.04'), Synset('social.s.05'),   
#Synset('social.s.06')]

ysyn = wordnet.synsets(y)
#ysyn
#[Synset('network.n.01'), Synset('network.n.02'), Synset('net.n.06'), 
#Synset('network.n.04'), Synset('network.n.05'), Synset('network.v.01')]

xlen = len(xsyn)
ylen = len(ysyn)

import numpy
simindex = numpy.zeros( (xlen,ylen) )

def relative_matrix(asyn,bsyn,simindex): # find similarity between asyn & bsyn

    I = -1
    J = -1

    for asyn_element in asyn:
        I += 1

        cb = wordnet.synset(asyn_element.name)
        J = -1
        for bsyn_element in bsyn:
            J += 1
            ib = wordnet.synset(bsyn_element.name)
            if not cb.pos == ib.pos: # compare nn , vv not nv or an
                continue
            score = cb.wup_similarity(ib)
            r = cb.path_similarity(ib)
            if simindex [I,J] < score:
                simindex [I,J] = score

 relative_matrix(xsyn,ysyn,simindex)
 print simindex
'''
array([[ 0.46153846,  0.125     ,  0.13333333,  0.125     ,  0.125     ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ]])
'''
#xsyn[0].definition
#'a party of people assembled to promote sociability and communal activity'
#ysyn[0].definition
#'an interconnected system of things or people'

If you see simindex[0,0] is the max value 0.46153846 so xsyn[0] and ysyn[0] seems to be best describe w1 = wordnet.synsets('social network') which you can see with definition.

Upvotes: 2

arturomp
arturomp

Reputation: 29630

There are some measures of semantic relatedness or similarity, but they're better defined for single words or single expressions in wordnet's lexicon - not for compounds of wordnet's lexical entries, as far as I know.

This is a nice web implementation of many similarity wordnet-based measures

Some further reading on interpreting compounds using wordnet similarity (although not evaluating similarity on compounds), if you're interested:

Upvotes: 3

Related Questions