kunalhdoshi
kunalhdoshi

Reputation: 3

I am having problems doing Word Sense Disambiguation in Python using Lesk algorithm

I am new to Python and NLTK so please bear with me. I wish to find the sense of a word in the context of a sentence. I am using the Lesk WSD algorithm but it is giving different outputs every time I run it. I know that Lesk has some level of inaccuracy. But, I think a POS tag will increase accuracy.

The Lesk algorithm takes a POS tag as an argument, but it takes 'n','s','v' as an input and not 'NN','VBP' or other POS tags which are outputted by the pos_tag() function. I would like to know how to tag words in the form of 'n','s','v', or if there is a method in which I can convert the 'NN','VBP' and other tags into 'n','s','v', so I can give them as an input to the lesk(context_sentence,word,pos_tag) function.

I am calculating the sentiment score of every word using SentiWordNet afterwards.

    from nltk.wsd import lesk
    from nltk import word_tokenize
    import nltk, re, pprint
    from nltk.corpus import sentiwordnet as swn

    def word_sense():

        sent = word_tokenize("He should be happy.")
        word = "be"
        pos = "v"
        score = lesk(sent,word,pos)
        print(score)
        print (str(score),type(score))
        set1 = re.findall("'([^']*)'",str(score))[0]
        print (set1)
        bank = swn.senti_synset(str(set1))
        print (bank)

    word_sense()

Upvotes: 0

Views: 1969

Answers (1)

alvas
alvas

Reputation: 122082

nltk.wsd.lesk does not return score, it returns the predicted Synset:

>>> from nltk.corpus import wordnet as wn
>>> from nltk.corpus import sentiwordnet as swn
>>> from nltk import word_tokenize
>>> from nltk.wsd import lesk
>>> sent = word_tokenize("He should be happy".lower())
>>> lesk(sent, 'be', 'v')
Synset('equal.v.01')

lesk is not perfect, it should only be used as a baseline system for WSD.

Although this is nice:

>>> ss = str(lesk(sent, 'be', 'v'))
>>> re.findall("'([^']*)'",ss)
['equal.v.01']

There's a simpler to get the synset identifier:

>>> lesk(sent, 'be', 'v').name()
u'equal.v.01'

Then you can do:

>>> swn.senti_synset(lesk(sent, 'be', 'v').name())
SentiSynset('equal.v.01')

To convert POS tag to WN POS, you can simply try: Converting POS tags from TextBlob into Wordnet compatible inputs

Upvotes: 1

Related Questions