user2269090
user2269090

Reputation: 11

word sense disambiguation in sentiwordnet python

I'm currently doing research for sentiment analysis in twitter. i want to combine predefined lexicon resource like sentiwordnet polarity score. and then proceed it with machine learning. the problem is in getting the correct score of sentiwordnet, previous work always simply choose by the total score of negative and positive polarity of the word meaning. i mean for example the word "mad" can appear 3 times as negative and 2 times as positive words. most of previous work will automatically average of each polarity. so i want to disambiguate the words before getting the score so we can really use the sentiwordnet as it should be. i was thinking by comparing the similarity of target sentence and gloss sentence.. is there any method to compare it? do you think it will works? if not please share your idea..

i'm completely new to this field and novice python programmer, so i really need advice from you.. thank you..

Upvotes: 1

Views: 2115

Answers (1)

vpekar
vpekar

Reputation: 3355

This is a word sense disambiguation problem, and getting your system to work reasonably well on any given multisense word will be very tough. You can try (a combination of) several methods to determine the right sense of a word:

  1. Pos tagging will reduce the number of candidate senses.

  2. Cosine similarity between the sentence and the gloss of each sense of the word in WordNet.

  3. Use SenseRelate: It measures the "WordNet similarity" between different senses of the target word and its surrounding words.

  4. Use WordNet Domains: the database contains domain labels assigned to each WordNet sense, such as "Music" for the music sense of "rock". Instead of comparing actual words that are found in the gloss and the sentence, you can compare the domain labels that are found in them.

  5. Represent the gloss and the sentence not by the words themselves that are found in them, but as an average co-occurrence vector of the words. Such vectors could be built using a large text corpus, preferably from the same application domain as the texts you are disambiguating. There are various techniques to refine such co-occurrence vectors (tf-idf, PCA, SVD), and you should read up on them separately.

If your texts come from a very specialist domain (e.g., law), the accuracy will be higher. But, if you work with general language texts, then you can expect good accuracy only on words that are not highly polysemous (if they have no more than 3-4 senses in WordNet)

Upvotes: 4

Related Questions