Tom
Tom

Reputation: 131

Error using nltk word_tokenize

I am doing some exercises from the NLTK book on accesing text from web and from disk (chapter 3). When calling word_tokenize I get an error.

This is my code:

>>> import nltk
>>> from urllib.request import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> tokens = nltk.word_tokenize(raw)

And this is the traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: cannot use a string pattern on a bytes-like object
>>>   File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 109, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in <listcomp>
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries

Can someone please explain me to me what is going on here and why I cannot seem to use word_tokenize properly?

Many thanks!

Upvotes: 4

Views: 8379

Answers (2)

Vaibhav Gaware
Vaibhav Gaware

Reputation: 31

I was getting the Error 404 for the url so I change the url .This works for me. you can change url to below. may be it works for you as well.

from urllib import request
url = "https://ia803405.us.archive.org/21/items/crimeandpunishme02554gut/2554.txt"
raw = request.urlopen(url).read()

Upvotes: 0

Dmitry
Dmitry

Reputation: 2096

You have to convert html (which is obtained as byte object) into a string using decode('utf-8'):

>>> import nltk
>>> from urllib.request import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> raw = raw.decode('utf-8')
>>> tokens = nltk.word_tokenize(raw)

Upvotes: 4

Related Questions