user2075215
user2075215

Reputation: 379

nltk stemming and stop words for naive bayes

I'm looking to understand why using stemming and stop words, results in worse results in my naive bayes classifier.

I have two files, positive and negative reviews, both of which have around 200 lines but with many words, possibly with 5000 words per line.

I have the following code that creates a bag of words and then I create two feature sets for training and testing, then I run it against the nltk classifier

word_features = list(all_words.keys())[:15000]

testing_set = featuresets[10000:]
training_set = featuresets[:10000]

nbclassifier = nltk.NaiveBayesClassifier.train(training_set)
print((nltk.classify.accuracy(nbclassifier, testing_set))*100)

nbclassifier.show_most_informative_features(30)

This produces around 45000 words and has an accuracy of 85%.

I've looked at adding stemming (PorterStemmer) and removing stop words in my training data, but when I run the classifier again I now get 205 words and a 0% accuracy in my classifier and while testing other classifiers the script generates errors

Traceback (most recent call last):
  File "foo.py", line 108, in <module>
    print((nltk.classify.accuracy(MNB_classifier, testing_set))*100)
  File "/Library/Python/2.7/site-packages/nltk/classify/util.py", line 87, in accuracy
    results = classifier.classify_many([fs for (fs, l) in gold])
  File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 83, in classify_many
    X = self._vectorizer.transform(featuresets)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 293, in transform
    return self._transform(X, fitting=False)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 184, in _transform
    raise ValueError("Sample sequence X is empty.")
ValueError: Sample sequence X is empty.

I don't understand why adding stemming and or removing stop words breaks the classifier?

Upvotes: 0

Views: 1627

Answers (1)

Sunjay Dhama
Sunjay Dhama

Reputation: 49

Adding stemming or removing stop words could not cause your issue. I think you have an issue further up in your code due to how you read the file. When I was following sentdex's tutorial on YouTube, I came across this same error. I was stuck for the past hour, but I finally got it. If you follow his code you get this:

short_pos = open("short_reviews/positive.txt", "r").read()
short_neg = open("short_reviews/negative.txt", "r").read()

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, 'pos' ))

for r in short_neg.split('\n'):
    documents.append( (r, 'neg' ))

all_words = []

short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w.lower())

for w in short_neg_words:
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:5000]

I kept running into this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6056: invalid start byte. You get this error because there are non UTF-8 characters in the files provided. I was able to get around the error by changing the code to this:

fname = 'short_reviews/positive.txt'
with open(fname, 'r', encoding='utf-16') as f:
    for line in f:
        pos_lines.append(line)

Unfortunately, then I started getting this error: UnicodeError: UTF-16 stream does not start with BOM

I forget how, but I made this error go away too. Then I started getting the same error as your original question: ValueError: Sample sequence X is empty. When I printed the length of featuresets, I saw it was only 2.

print("Feature sets list length : ", len(featuresets))

After digging on this site, I found these two questions:

  1. Delete every non utf-8 symbols froms string
  2. 'str' object has no attribute 'decode' in Python3

The first one didn't really help, but the second one solved my problem (Note: I'm using ).

I'm not one for one liners, but this worked for me:

pos_lines = [line.rstrip('\n') for line in open('short_reviews/positive.txt', 'r', encoding='ISO-8859-1')]

I will update my github repo later this week with the full code for the tutorial if you'd like to see the complete solution. I realize this answer probably comes 2 years too late, but hopefully it helps.

Upvotes: 1

Related Questions