Rohit Pandey
Rohit Pandey

Reputation: 2681

Python nltk Naive Bayes doesn't seem to work

I'm using the nltk book - Natural Language Processing with Python(2009) and looking at the Naive Bayes classifier. In particular, Example 6-3 on Pg 228 in my version. The training set is movie reviews.

classifier = nltk.NaiveBayesClassifier.train(train_set)

I peek at the most informative features -

classifier.show_most_informative_features(5)

and I get 'outstanding', 'mulan' and 'wonderfully' among the top ranking ones for the sentence to be tagged 'positive'.

So, I try the following -

in1 = 'wonderfully mulan'
classifier.classify(document_features(in1.split()))

And I get 'neg'. Now this makes no sense. These were supposed to be the top features.

the document_features function is taken directly from the book -

def document_features(document): 
 document_words = set(document) 
 features = {}
 for word in word_features:
  features['contains(%s)' % word] = (word in document_words)
 return features

Upvotes: 1

Views: 937

Answers (2)

tpacker
tpacker

Reputation: 11

There are at least two different flavors of the naive Bayes classifier. In a quick search, it appears that NLTK implements the Bernoulli flavor: Different results between the Bernoulli Naive Bayes in NLTK and in scikit-learn . In any case, some flavors of naive Bayes pay attention to words/features missing from a document as much as the visible words. So, if you try to classify a document containing a few positive words but that document is also lacking many words that indicate a negative document when they are missing, it is very reasonable that the document will be categorized as negative. So, the bottom line is, pay attention to not only the visible features but also the missing features (depending on the details of the naive Bayes implementation).

Upvotes: 0

arturomp
arturomp

Reputation: 29580

Note that the feature vector in that example is comprised of the "2000 most frequent words in the overall corpus." So assuming that the corpus is comprehensive, a regular review will probably have quite a few of those words. (In real-world reviews of the latest Jackass movie and Dallas Buyers Club, I get 26/2000 and 28/2000 features respectively.)

If you feed it a review containing only "wonderfully mulan", the resulting feature vector only has 2/2000 features set to True. Basically, you're giving it a pseudoreview with little to no information that it knows about or that it can do anything with. For that vector, it's hard to tell what it will predict.

The feature vector should be healthily populated with vectors leaning in a positive direction for it to output pos. Maybe look at the most informative, say, 500 features, look at which ones lean positively and then create a string with only those? That might get you closer to pos, but not necessarily.

Some feature vectors in the train_set classify as pos. (Anecdotally, I found one of them to have 417 features equal to True). However, in my tests, no documents from the neg or pos training set partitions classified to pos, so while you may be right that the classifier doesn't seem to be doing a great job - at least the pos training examples should classify to pos - the example you're giving it is not a great measure of that.

Upvotes: 2

Related Questions