byoungdale
byoungdale

Reputation: 181

nltk PlaintextCorpusReader sents and paras functions not working

I cannot get the paras and sents function in the PlaintextCorpusReader to work. Here is the code I have:

import nltk
from nltk.corpus import PlaintextCorpusReader

corpus_root = './dir_root'
newcorpus = PlaintextCorpusReader(corpus_root, '.*') # Files you want to add

word_list = newcorpus.words('file1.txt')
sentence_list = newcorpus.sents('file1.txt')
paragraph_list = newcorpus.paras('file1.txt')

print(word_list)
print(sentence_list)
print(paragraph_list)

word_list comes out fine.

['__________________________________________________________________', 'Title', ...]

But, paragraph_list and sentence_list both give this error:

    Traceback (most recent call last):
  File "corpus.py", line 13, in <module>
    print(sentence_list)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/collections.py", line 225, in __repr__
    for elt in self:
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
    tokens = self.read_block(self._stream)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/plaintext.py", line 129, in _read_sent_block
    for sent in self._sent_tokenizer.tokenize(para)])
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 956, in __getattr__
    self.__load()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 948, in __load
    resource = load(self._path)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 808, in load
    opened_resource = _open(resource_url)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 926, in _open
    return find(path_, path + ['']).open()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 648, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource 'tokenizers/punkt/PY3/english.pickle' not found.
  Please use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/Users/username/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

I tried using the nltk.download() to download the file into the corpus, but that did not work either. Plus it did not seem like the way it should work since the PlaintextCorpusReader does that already. The paras and sents functions are apart of the PlaintextCorpusReader. Is there a particular fieldid I need to enter? Or, is there some sort of regex argument it requires to find the sentences or paragraphs? The documentation and source code does not seem to say it needs anything more than the words function does.

Upvotes: 1

Views: 2067

Answers (1)

alexis
alexis

Reputation: 50220

You're missing a data file ("resource") needed by the sentence tokenizer. Fix the problem by downloading the "punkt" resource under "Models" in the interactive downloader, or non-interactively by running this code once:

nltk.download("punkt")

To avoid running into this kind of problem repeatedly as you explore the nltk, I recommend downloading the "book" bundle now. It contains everything you're likely to need for a while.

Upvotes: 5

Related Questions