Reputation: 379
I'm looking to understand why using stemming and stop words, results in worse results in my naive bayes classifier.
I have two files, positive and negative reviews, both of which have around 200 lines but with many words, possibly with 5000 words per line.
I have the following code that creates a bag of words and then I create two feature sets for training and testing, then I run it against the nltk classifier
word_features = list(all_words.keys())[:15000]
testing_set = featuresets[10000:]
training_set = featuresets[:10000]
nbclassifier = nltk.NaiveBayesClassifier.train(training_set)
print((nltk.classify.accuracy(nbclassifier, testing_set))*100)
nbclassifier.show_most_informative_features(30)
This produces around 45000 words and has an accuracy of 85%.
I've looked at adding stemming (PorterStemmer) and removing stop words in my training data, but when I run the classifier again I now get 205 words and a 0% accuracy in my classifier and while testing other classifiers the script generates errors
Traceback (most recent call last):
File "foo.py", line 108, in <module>
print((nltk.classify.accuracy(MNB_classifier, testing_set))*100)
File "/Library/Python/2.7/site-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs, l) in gold])
File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 83, in classify_many
X = self._vectorizer.transform(featuresets)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 293, in transform
return self._transform(X, fitting=False)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 184, in _transform
raise ValueError("Sample sequence X is empty.")
ValueError: Sample sequence X is empty.
I don't understand why adding stemming and or removing stop words breaks the classifier?
Upvotes: 0
Views: 1627
Reputation: 49
Adding stemming or removing stop words could not cause your issue. I think you have an issue further up in your code due to how you read the file. When I was following sentdex's tutorial on YouTube, I came across this same error. I was stuck for the past hour, but I finally got it. If you follow his code you get this:
short_pos = open("short_reviews/positive.txt", "r").read()
short_neg = open("short_reviews/negative.txt", "r").read()
documents = []
for r in short_pos.split('\n'):
documents.append( (r, 'pos' ))
for r in short_neg.split('\n'):
documents.append( (r, 'neg' ))
all_words = []
short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)
for w in short_pos_words:
all_words.append(w.lower())
for w in short_neg_words:
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:5000]
I kept running into this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6056: invalid start byte
.
You get this error because there are non UTF-8 characters in the files provided. I was able to get around the error by changing the code to this:
fname = 'short_reviews/positive.txt'
with open(fname, 'r', encoding='utf-16') as f:
for line in f:
pos_lines.append(line)
Unfortunately, then I started getting this error:
UnicodeError: UTF-16 stream does not start with BOM
I forget how, but I made this error go away too. Then I started getting the same error as your original question:
ValueError: Sample sequence X is empty.
When I printed the length of featuresets
, I saw it was only 2.
print("Feature sets list length : ", len(featuresets))
After digging on this site, I found these two questions:
The first one didn't really help, but the second one solved my problem (Note: I'm using python-3).
I'm not one for one liners, but this worked for me:
pos_lines = [line.rstrip('\n') for line in open('short_reviews/positive.txt', 'r', encoding='ISO-8859-1')]
I will update my github repo later this week with the full code for the nlp tutorial if you'd like to see the complete solution. I realize this answer probably comes 2 years too late, but hopefully it helps.
Upvotes: 1