"Cloning" a corpus in NLTK?

Question

I'm attempting to create my own corpus in NLTK. I've been reading some of the documentation on this and it seems rather complicated... all I wanted to do is "clone" the movie reviews corpus but with my own text. Now, I know I can just change files in the move reviews corpus to my own... but that limits me to working with just one such corpus at a time (ie. I'd have to continually be swapping files). is there any way i could just clone the movie reviews corpus?

thanks Alex

alexis · Accepted Answer

The movie reviews are read with the CategorizedPlaintextCorpusReader class. Use it directly to load your corpus. The following should work for an exact copy of the movies corpus:

mr = CategorizedPlaintextCorpusReader(path_to_your_reviews, r'(?!\.).*\.txt',
        cat_pattern=r'(neg|pos)/.*')

Whatever maches inside cat_pattern are the categories: In this case, neg and pos. If your corpus has different categories (e.g., movie genres rather than positive/negative evaluations), change the directory structure and adjust the cat_pattern parameter to match.

PS. For categorized corpora with different structure, the nltk offers a wealth of ways to specify the category; read the documentation of CategorizedPlaintextCorpusReader.

"Cloning" a corpus in NLTK?

Answers (2)

Related Questions

&quot;Cloning&quot; a corpus in NLTK?

Answers (2)

Related Questions

"Cloning" a corpus in NLTK?