Jennifer
Jennifer

Reputation: 295

Create Corpus using PlainTextCorpusReader and Analyzing It

I am relatively new to python and I am interested in understanding how the create a corpus using the PlainTextCorpusReader aspect of NLTK. I got as far as importing all the documents in. However, when I run code to tokenize text throughout the corpus, it returns an error. I apologize if this question is a duplicate, but I would like some insight on this.

Here is the code for importing the documents. I have a bunch of documents on my computer related to the 2016 DNC (For reproducibility, take some or all of the text files from https://github.com/lin-jennifer/2016NCtranscripts)

import os
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import stopwords

corpus_root = '/Users/JenniferLin/Desktop/Data/DNCtexts'
DNClist = PlaintextCorpusReader(corpus_root, '.*')

DNClist.fileids()

#Print the words of one of the texts to make sure everything is loaded
DNClist.words('dnc.giffords.txt')

type(DNClist)

str(DNClist)

When I go to tokenize the text, here is the code and output

Code:

from nltk.tokenize import sent_tokenize, word_tokenize

DNCtokens = sent_tokenize(DNClist)

Output: TypeError: expected string or bytes-like object

Even if I do something like DNClist.paras(), I get an error that reads UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 7: invalid start byte

I am wondering if there is an error in how I load the documents or in the process of tokenizing.

Thank you so much!

Upvotes: 2

Views: 4178

Answers (1)

Mike DeLong
Mike DeLong

Reputation: 338

It looks like what you want to do is tokenize the plain text documents in the folder. If this is what you want, you do this by asking the PlainTextCorpusReader for the tokens, rather than trying to pass the sentence tokenizer the PlainTextCorpusReader. So instead of

DNCtokens = sent_tokenize(DNClist)

please consider

DNCtokens = DNClist.sents() to get the sentences or DNCtokens = DNClist.paras() to get the paragraphs.

The source code for the reader shows that it holds a word tokenizer and a sentence tokenizer, and will call them to do the tokenization that it looks like you want.

Upvotes: 2

Related Questions