Reputation: 1395
In Chapter 6 of the NLTK book, section 2.1 the code calls the movie reviews corpus for document classification. The code in the book is as follows:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
I have my own dataset comma separated (text, category) between texts of emails and either positive or negative for the category. Can I call .words() on my own file? Also what does the code mean when it calls movie_reviews.categories(). I am having trouble understanding how to structure the data to get it into the form needed by the code. I have look at the individual corpus files but I can't figure out what to do from here. Any help would be appreciated. Thanks!
Upvotes: 1
Views: 1484
Reputation: 29580
words()
just returns "the given file(s) as a list of words and punctuation symbols" according to the documentation. In that respect, you can definitely call nltk.corpus.words()
on any text file you have.
As for categories()
, further down in the documentation, it says that it "Return[s] a list of the categories that are defined for this corpus, or for the file(s) if it is given." However, the source for it is a bit more obscure. Notice that different corpora have different ways of indicating their categories. movie_reviews
does it through directory names, but abc
and reuters
have explicit categories in a file. qc
has the categories in the same file as with the text.
It might take a bit of experimenting with your own data to see if you can replicate this behaviour, but a reasonable first step would be to add a directory containing a subset of your data to nltk_data/corpora
and to play around with the formats you see in other corpora.
Upvotes: 1