Reputation: 348
I am using Google Colab to work on a script that I learn through video. Unfortunately, I get an error though following the video instructions.
sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)
causes a problem. Both lines. I have tried each of it standing alone, in Python 3 (which I use mainly). Here are the imported libraries:
from urllib import request
from bs4 import BeautifulSoup as bs
import re
import nltk
import heapq
The error I get is:
---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
<ipython-input-13-2467ae276de5> in <module>()
26 allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanedData)
27
---> 28 sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
29 words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)
30
Frankly spoken, I don't understand the error.
What's the problem that I don't see?
-- Here's the whole code
from urllib import request
from bs4 import BeautifulSoup as bs
import re
import nltk
import heapq
url="https://en.wikipedia.org/wiki/Machine_learning"
allParagraphContent = ""
htmlDoc=request.urlopen(url)
soupObject=bs(htmlDoc,'html.parser')
for paragraphContent in paragraphContents:
allParagraphContent += paragraphContent.text
allParagraphContent_cleanerData=re.sub(r'\[0-9]*\]','',allParagraphContent)
allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanerData)
allParagraphContent_cleanedData=re.sub(r'[^a-zA-Z]','',allParagraphContent_cleanedData)
allParagraphContent_cleanedData=re.sub(r'\s+','',allParagraphContent_cleanedData)
sentences_tokens=nltk.sent_tokenize(allParagraphContent_cleanedData)
words_tokens=nltk.word_tokenize(allParagraphContent_cleanedData)
The solution:
adding nltk.download("popular")
after import nltk
Upvotes: 1
Views: 4991
Reputation: 18367
This error usually appears when there's a module missing. This can be solved by using the download()
method and specifying the module. Furthermore, you can pass 'all'
and just download everything. The code would be:
nltk.download('all')
Upvotes: 4