LEONARDO MIGLIORINI
LEONARDO MIGLIORINI

Reputation: 31

Python nltk Resource u'tokenizers/punkt/english.pickle' not found bu It is actually present

Here is my code, just performing some tokenization with nltk.

import nltk
from nltk.corpus import stopwords
tokens = nltk.word_tokenize(doc, language='english')
# remove all the stopwords
filtered = [w for w in tokens if (w not in stopwords.words('english')) and (w.isalnum())]

I've already downloaded the punkt package. I also tried to copy and paste the correct folder into the places that the error message said it searched. Here is the error, that I saw in other similar questions.

Resource u'tokenizers/punkt/english.pickle' not found.
Please use the NLTK Downloader to obtain the resource: >>>

nltk.download() Searched in:

- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''

I even tried to reinstall the whole nltk and packages, but it didn't work. Useful information about the environment: -run through terminal of Pycharm IDE -operting system: Ubuntu 15 -nltk installed using pip -nltk_data installed in the default location /home/user/nltk_data

Please, don't tell me to use nltk.download('punkt') because I have it. Thanks for your help.

Upvotes: 3

Views: 5155

Answers (2)

Zcauchon
Zcauchon

Reputation: 68

If you're running this in a distributed environment you'll have to download the NLTK data files out to each node. Here's how you would do it in a Spark environment:

 sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')

Upvotes: 0

tremendows
tremendows

Reputation: 4382

You have to install the nltk-punkt to tokenize.

  • How?

    1. Open a Terminal.
    2. Execute python command to enter the python environment.
    3. Execute import nltk
    4. Execute nltk.download('punkt')

Your terminal might look this way:

enter image description here

Upvotes: 3

Related Questions