nltk corpus tweeter_sample by category

Question

I want to train the nltk with the tweeter_sample corpus, but I get an error when I try to load the sample by category.

First I tried like that:

from nltk.corpus import twitter_samples

documents = [(list(twitter_samples.strings(fileid)), category)
             for category in twitter_samples.categories()
             for fileid in twitter_samples.fileids(category)]

but it gave me this error:

    Traceback (most recent call last):
  File "C:/Users/neptun/PycharmProjects/Thesis/First_sentimental.py", line 6, in 
    for category in twitter_samples.categories()
  File "C:\Users
eptun\AppData\Local\Programs\Python\Python36-32\lib\site-packages
ltk\corpus\util.py", line 119, in __getattr__
    return getattr(self, attr)
AttributeError: 'TwitterCorpusReader' object has no attribute 'categories'

I don't know how to give them the available attributes in order to have my list with positive and negative sentiment.

alexis · Accepted Answer

If you inspect twitter_samples.fileids(), you'll see that there are separate positive and negative files:

>>> twitter_samples.fileids()
['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

So to get the tweets classified as positive or negative, just select the corresponding file. It's not the usual way the nltk handles categorized corpora, but there you have it.

documents = ([(t, "pos") for t in twitter_samples.strings("positive_tweets.json")] + 
             [(t, "neg") for t in twitter_samples.strings("negative_tweets.json")])

This will get you a dataset of 10000 tweets. The third file contains another 20000, which apparently are not categorized.

nltk corpus tweeter_sample by category

Answers (2)

Related Questions