Reputation: 189
I'm trying to train my own corpora for sentiment analysis, using NLTK for python. I have two text files: one has 25K positive tweets, separated per line, the other one 25K negative tweets.
I use this Stackoverflow article, method 2
When I run this code to create corpora:
import string
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk
mydir = 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
I receive error message:
C:\Users\gerbuiker\Anaconda\python.exe "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py"
Traceback (most recent call last):
File "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py", line 23, in <module>
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from
assert self._len is not None
AssertionError
Process finished with exit code 1
Does anyone know how to fix this?
Upvotes: 1
Views: 1678
Reputation: 1047
I'm not 100% positive as I'm not on a Windows machine to test this at the moment, but I think what may be catching you up is the difference between the path slash direction in @alvas original example and your adaptation to windows.
Specifically, you use: 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'
while his example uses '/home/alvas/my_movie_reviews'
. For the most part this is fine, but you attempt to re-use his cat_pattern
regex: r'(neg|pos)/.*'
which will match the slash in his paths but reject the one in yours.
Upvotes: 1