thesisstudent
thesisstudent

Reputation: 119

NLTK: sentiment analysis: result one value

So sorry for posting this, as the answer probably is in either this: NLTK sentiment analysis is only returning one value

or this post: Python NLTK not sentiment calculate correct

but I don't get how to apply it to my code.

I'm a huge newbie at Python and NLTK and I hate that I have to bother you with a huge block of code, so sorry once again.

With the code I use, I always get 'pos' as a result. I've tried doing the classification by leaving the positive features out of the training set. Then the return is always 'neutral'.

Can anybody tell me what I'm doing wrong? Thank you so much in advance! And don't mind the random test sentence I used, it was just something that came up while I was trying to figure out what was wrong.

import re, math, collections, itertools
import nltk
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist  
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import *
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english", ignore_stopwords = True)

pos_tweets = ['I love bananas','I like pears','I eat oranges']
neg_tweets = ['I hate lettuce','I do not like tomatoes','I hate apples']
neutral_tweets = ['I buy chicken','I am boiling eggs','I am chopping vegetables']

def uni(doc):
    x = []
    y = []
    for tweet in doc:
        x.append(word_tokenize(tweet))
    for element in x:
        for word in element:
            if len(word)>2:
                word = word.lower()
                word = stemmer.stem(word)
                y.append(word)
    return y

def word_feats_uni(doc):
     return dict([(word, True) for word in uni(doc)])

def tokenizer_ngrams(document):
    all_tokens = []
    filtered_tokens = []
    for (sentence) in document:
        all_tokens.append(word_tokenize(sentence))
    return all_tokens

def get_bi (document):
    x = tokenizer_ngrams(document)
    c = []
    for sentence in x:
        c.extend([bigram for bigram in nltk.bigrams(sentence)])
    return c

def get_tri(document):
    x = tokenizer_ngrams(document)
    c = []
    for sentence in x:
        c.extend([bigram for bigram in nltk.bigrams(sentence)])
    return c

def word_feats_bi(doc): 
    return dict([(word, True) for word in get_bi(doc)])

def word_feats_tri(doc):
    return dict([(word, True) for word in get_tri(doc)])

def word_feats_test(doc):
    feats_test = {}
    feats_test.update(word_feats_uni(doc))
    feats_test.update(word_feats_bi(doc))
    feats_test.update(word_feats_tri(doc))
    return feats_test

pos_feats = [(word_feats_uni(pos_tweets),'pos')] + [(word_feats_bi(pos_tweets),'pos')] + [(word_feats_tri(pos_tweets),'pos')]

neg_feats = [(word_feats_uni(neg_tweets),'neg')] + [(word_feats_bi(neg_tweets),'neg')] + [(word_feats_tri(neg_tweets),'neg')]

neutral_feats = [(word_feats_uni(neutral_tweets),'neutral')] + [(word_feats_bi(neutral_tweets),'neutral')] + [(word_feats_tri(neutral_tweets),'neutral')]

trainfeats = pos_feats + neg_feats + neutral_feats

classifier = NaiveBayesClassifier.train(trainfeats)

print (classifier.classify(word_feats_test('I am chopping vegetables and boiling eggs')))

Upvotes: 2

Views: 514

Answers (1)

The solution is very simple. Your word_feats_test will return an empty dictionary for the sentence 'I am chopping vegetables and boiling eggs'; thus the classifier is biased towards pos in case of no features.

I wrapped your sentence in a list:

print(classifier.classify(word_feats_test(
      ['I am chopping vegetables and boiling eggs'])))

and neutral is printed.

You ought to use the exact same function to calculate the features for all 3: the training set, testing set and classification.

Upvotes: 1

Related Questions