nltk stemming and stop words for naive bayes

Question

I'm looking to understand why using stemming and stop words, results in worse results in my naive bayes classifier.

I have two files, positive and negative reviews, both of which have around 200 lines but with many words, possibly with 5000 words per line.

I have the following code that creates a bag of words and then I create two feature sets for training and testing, then I run it against the nltk classifier

word_features = list(all_words.keys())[:15000]

testing_set = featuresets[10000:]
training_set = featuresets[:10000]

nbclassifier = nltk.NaiveBayesClassifier.train(training_set)
print((nltk.classify.accuracy(nbclassifier, testing_set))*100)

nbclassifier.show_most_informative_features(30)

This produces around 45000 words and has an accuracy of 85%.

I've looked at adding stemming (PorterStemmer) and removing stop words in my training data, but when I run the classifier again I now get 205 words and a 0% accuracy in my classifier and while testing other classifiers the script generates errors

Traceback (most recent call last):
  File "foo.py", line 108, in 
    print((nltk.classify.accuracy(MNB_classifier, testing_set))*100)
  File "/Library/Python/2.7/site-packages/nltk/classify/util.py", line 87, in accuracy
    results = classifier.classify_many([fs for (fs, l) in gold])
  File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 83, in classify_many
    X = self._vectorizer.transform(featuresets)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 293, in transform
    return self._transform(X, fitting=False)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 184, in _transform
    raise ValueError("Sample sequence X is empty.")
ValueError: Sample sequence X is empty.

I don't understand why adding stemming and or removing stop words breaks the classifier?

nltk stemming and stop words for naive bayes

Answers (1)

Related Questions