Reputation: 181
I cannot get the paras and sents function in the PlaintextCorpusReader to work. Here is the code I have:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = './dir_root'
newcorpus = PlaintextCorpusReader(corpus_root, '.*') # Files you want to add
word_list = newcorpus.words('file1.txt')
sentence_list = newcorpus.sents('file1.txt')
paragraph_list = newcorpus.paras('file1.txt')
print(word_list)
print(sentence_list)
print(paragraph_list)
word_list comes out fine.
['__________________________________________________________________', 'Title', ...]
But, paragraph_list and sentence_list both give this error:
Traceback (most recent call last):
File "corpus.py", line 13, in <module>
print(sentence_list)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/collections.py", line 225, in __repr__
for elt in self:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
tokens = self.read_block(self._stream)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/plaintext.py", line 129, in _read_sent_block
for sent in self._sent_tokenizer.tokenize(para)])
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 956, in __getattr__
self.__load()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 948, in __load
resource = load(self._path)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 808, in load
opened_resource = _open(resource_url)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 926, in _open
return find(path_, path + ['']).open()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 648, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource 'tokenizers/punkt/PY3/english.pickle' not found.
Please use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/Users/username/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
I tried using the nltk.download() to download the file into the corpus, but that did not work either. Plus it did not seem like the way it should work since the PlaintextCorpusReader does that already. The paras and sents functions are apart of the PlaintextCorpusReader. Is there a particular fieldid I need to enter? Or, is there some sort of regex argument it requires to find the sentences or paragraphs? The documentation and source code does not seem to say it needs anything more than the words function does.
Upvotes: 1
Views: 2067
Reputation: 50220
You're missing a data file ("resource") needed by the sentence tokenizer. Fix the problem by downloading the "punkt" resource under "Models" in the interactive downloader, or non-interactively by running this code once:
nltk.download("punkt")
To avoid running into this kind of problem repeatedly as you explore the nltk, I recommend downloading the "book" bundle now. It contains everything you're likely to need for a while.
Upvotes: 5