NLTK document classification

Question

In Chapter 6 of the NLTK book, section 2.1 the code calls the movie reviews corpus for document classification. The code in the book is as follows:

from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
         for category in movie_reviews.categories()
         for fileid in movie_reviews.fileids(category)]
 random.shuffle(documents)

I have my own dataset comma separated (text, category) between texts of emails and either positive or negative for the category. Can I call .words() on my own file? Also what does the code mean when it calls movie_reviews.categories(). I am having trouble understanding how to structure the data to get it into the form needed by the code. I have look at the individual corpus files but I can't figure out what to do from here. Any help would be appreciated. Thanks!

arturomp · Accepted Answer

words() just returns "the given file(s) as a list of words and punctuation symbols" according to the documentation. In that respect, you can definitely call nltk.corpus.words() on any text file you have.

As for categories(), further down in the documentation, it says that it "Return[s] a list of the categories that are defined for this corpus, or for the file(s) if it is given." However, the source for it is a bit more obscure. Notice that different corpora have different ways of indicating their categories. movie_reviews does it through directory names, but abc and reuters have explicit categories in a file. qc has the categories in the same file as with the text.

It might take a bit of experimenting with your own data to see if you can replicate this behaviour, but a reasonable first step would be to add a directory containing a subset of your data to nltk_data/corpora and to play around with the formats you see in other corpora.

NLTK document classification

Answers (1)

Related Questions