Reputation: 31
Here is my code, just performing some tokenization with nltk.
import nltk
from nltk.corpus import stopwords
tokens = nltk.word_tokenize(doc, language='english')
# remove all the stopwords
filtered = [w for w in tokens if (w not in stopwords.words('english')) and (w.isalnum())]
I've already downloaded the punkt package. I also tried to copy and paste the correct folder into the places that the error message said it searched. Here is the error, that I saw in other similar questions.
Resource u'tokenizers/punkt/english.pickle' not found.
Please use the NLTK Downloader to obtain the resource: >>>
nltk.download() Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
I even tried to reinstall the whole nltk and packages, but it didn't work. Useful information about the environment: -run through terminal of Pycharm IDE -operting system: Ubuntu 15 -nltk installed using pip -nltk_data installed in the default location /home/user/nltk_data
Please, don't tell me to use nltk.download('punkt') because I have it. Thanks for your help.
Upvotes: 3
Views: 5155
Reputation: 68
If you're running this in a distributed environment you'll have to download the NLTK data files out to each node. Here's how you would do it in a Spark environment:
sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')
Upvotes: 0
Reputation: 4382
You have to install the nltk-punkt
to tokenize.
How?
python
command to enter the python environment.import nltk
nltk.download('punkt')
Your terminal might look this way:
Upvotes: 3