mvh
mvh

Reputation: 189

Train corpus of Tweets for Sentiment Analysis, using NLTK for Python

I'm trying to train my own corpora for sentiment analysis, using NLTK for python. I have two text files: one has 25K positive tweets, separated per line, the other one 25K negative tweets.

I use this Stackoverflow article, method 2

When I run this code to create corpora:

import string
from itertools import chain

from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk

mydir = 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'

mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

I receive error message:

C:\Users\gerbuiker\Anaconda\python.exe "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py"
Traceback (most recent call last):
  File "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py", line 23, in <module>
    documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
  File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from
    assert self._len is not None
AssertionError

Process finished with exit code 1

Does anyone know how to fix this?

Upvotes: 1

Views: 1678

Answers (1)

abathur
abathur

Reputation: 1047

I'm not 100% positive as I'm not on a Windows machine to test this at the moment, but I think what may be catching you up is the difference between the path slash direction in @alvas original example and your adaptation to windows.

Specifically, you use: 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews' while his example uses '/home/alvas/my_movie_reviews'. For the most part this is fine, but you attempt to re-use his cat_pattern regex: r'(neg|pos)/.*' which will match the slash in his paths but reject the one in yours.

Upvotes: 1

Related Questions