Reputation: 920
I'm attempting to create my own corpus in NLTK. I've been reading some of the documentation on this and it seems rather complicated... all I wanted to do is "clone" the movie reviews corpus but with my own text. Now, I know I can just change files in the move reviews corpus to my own... but that limits me to working with just one such corpus at a time (ie. I'd have to continually be swapping files). is there any way i could just clone the movie reviews corpus?
thanks Alex
Upvotes: 0
Views: 336
Reputation: 50220
The movie reviews are read with the CategorizedPlaintextCorpusReader
class. Use it directly to load your corpus. The following should work for an exact copy of the movies corpus:
mr = CategorizedPlaintextCorpusReader(path_to_your_reviews, r'(?!\.).*\.txt',
cat_pattern=r'(neg|pos)/.*')
Whatever maches inside cat_pattern
are the categories: In this case, neg
and pos
. If your corpus has different categories (e.g., movie genres rather than positive/negative evaluations), change the directory structure and adjust the cat_pattern
parameter to match.
PS. For categorized corpora with different structure, the nltk offers a wealth of ways to specify the category; read the documentation of CategorizedPlaintextCorpusReader
.
Upvotes: 1
Reputation: 4182
Why don't you a define a new corpus by copying the definition of movie_reviews
in nltk.corpus
? You can do this all you want with new directories, and then copy the directory structure and replace the files.
Upvotes: 0