Reputation: 1512
I am experimenting with NaiveBayesClassifier and have following training data:
positive_vocab = [ 'awesome' ]
negative_vocab = [ 'bad']
neutral_vocab = [ 'so-so' ]
...
classifier = NaiveBayesClassifier.train(train_set)
I then classify following sentence: bad Awesome movie, I liked it
Here is what I get for each word:
bad:neg awesome:pos movie,:pos i:pos liked:pos it:pos
How/why decision is made to classify words not in the training set (such as I Liked It, Movie) as positive?
thanks
Upvotes: 0
Views: 327
Reputation: 862
Training a sentiment model means that your model learns how words affect the sentiment. Thus it's not about specifying which words are positive and which are negative — it's about how to train your model to understand that from a text by itself.
The simplest implementation is called "bag of words" (which is usually used with TF-IDF normalization). Bag of words works this way: you split your text by words and count occurrences of each word within the given text block (or review). In this way rows correspond to different reviews, and columns correspond to the number of occurrences of the given word within the given review. This table becomes your X
and the target sentiment to predict becomes your Y
(say 0 for negative and 1 for positive) .
Then you train your classifier:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
reviews, Y = your_load_function()
vectorizer = TfidfVectorizer() # or CountVectorizer()
X = vectorizer.fit_transform(reviews) # convert text to words counts
model = MultinomialNB()
model.fit(X, Y)
After the model is trained you can make predictions:
new_reviews = your_load_function2()
new_X = vectorizer.transform(new_reviews)
predicted_Y = model.predict(new_X)
Further reading:
https://en.wikipedia.org/wiki/Bag-of-words_model
https://en.wikipedia.org/wiki/Tf-idf
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Upvotes: 1