leon
leon

Reputation: 1512

NLTK NaiveBayesClassifier classifier issues

I am experimenting with NaiveBayesClassifier and have following training data:

positive_vocab = [ 'awesome' ]
negative_vocab = [ 'bad']
neutral_vocab = [ 'so-so' ]
...
classifier = NaiveBayesClassifier.train(train_set) 

I then classify following sentence: bad Awesome movie, I liked it

Here is what I get for each word:

bad:neg awesome:pos movie,:pos i:pos liked:pos it:pos

How/why decision is made to classify words not in the training set (such as I Liked It, Movie) as positive?

thanks

Upvotes: 0

Views: 327

Answers (1)

Dmitry Mottl
Dmitry Mottl

Reputation: 862

Training a sentiment model means that your model learns how words affect the sentiment. Thus it's not about specifying which words are positive and which are negative — it's about how to train your model to understand that from a text by itself.

The simplest implementation is called "bag of words" (which is usually used with TF-IDF normalization). Bag of words works this way: you split your text by words and count occurrences of each word within the given text block (or review). In this way rows correspond to different reviews, and columns correspond to the number of occurrences of the given word within the given review. This table becomes your X and the target sentiment to predict becomes your Y (say 0 for negative and 1 for positive) .

Then you train your classifier:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

reviews, Y = your_load_function()

vectorizer = TfidfVectorizer()  # or CountVectorizer()
X = vectorizer.fit_transform(reviews)  # convert text to words counts

model = MultinomialNB()
model.fit(X, Y)

After the model is trained you can make predictions:

new_reviews = your_load_function2()
new_X = vectorizer.transform(new_reviews)
predicted_Y = model.predict(new_X)

Further reading:
https://en.wikipedia.org/wiki/Bag-of-words_model
https://en.wikipedia.org/wiki/Tf-idf
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Upvotes: 1

Related Questions