Python with NLTK shows error at sent_tokenize and word_tokenize

Question

I am using Google Colab to work on a script that I learn through video. Unfortunately, I get an error though following the video instructions.

sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)

causes a problem. Both lines. I have tried each of it standing alone, in Python 3 (which I use mainly). Here are the imported libraries:

from urllib import request
from bs4 import BeautifulSoup as bs
import re
import nltk
import heapq

The error I get is:

---------------------------------------------------------------------------

LookupError                               Traceback (most recent call last)

 in ()
     26 allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanedData)
     27 
---> 28 sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
     29 words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)
     30

Frankly spoken, I don't understand the error.

What's the problem that I don't see?

-- Here's the whole code

from urllib import request
from bs4 import BeautifulSoup as bs
import re
import nltk
import heapq



url="https://en.wikipedia.org/wiki/Machine_learning"
allParagraphContent = ""
htmlDoc=request.urlopen(url)
soupObject=bs(htmlDoc,'html.parser')



for paragraphContent in paragraphContents:
    allParagraphContent += paragraphContent.text


allParagraphContent_cleanerData=re.sub(r'$$0-9]*$$','',allParagraphContent)
allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanerData)



allParagraphContent_cleanedData=re.sub(r'[^a-zA-Z]','',allParagraphContent_cleanedData)
allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanedData)

sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)

The solution: adding nltk.download("popular") after import nltk

Celius Stingher · Accepted Answer

This error usually appears when there's a module missing. This can be solved by using the download() method and specifying the module. Furthermore, you can pass 'all' and just download everything. The code would be:

nltk.download('all')

Python with NLTK shows error at sent_tokenize and word_tokenize

Answers (1)

Related Questions