error using "nltk.word_tokenize()" function

Question

I'm trying to tokenize twitter text. When I apply the function nltk.word_tokenize() each single twitter text, it works perfectly even for some very ugly one such as

'\xd8\xb3\xd8\xa3\xd9\x87\xd9\x8e\xd9\x85\xd9\x90\xd8\xb3\xd9\x8f',
'\xd9\x82\xd9\x90\xd8\xb5\xd9\x8e\xd9\x91\xd8\xa9\xd9\x8b', '\xd8\xad\xd8\xaa\xd9\x89'

but when I loop through all the twitter in a file

tokens = []
for i in range(0,5047591):
    s = ','.join(l_of_l[i])
    tokens += nltk.word_tokenize(s)

it returns errors such as:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries for sl1, sl2 in _pair_iter(slices):

and many more

any suggestion about how to fix it?

Leb · Accepted Answer

The problem you're getting is not from the code you included, it's from the code that include open() command. The script is opening the file fine, but when you're accessing your data, it's give you that TraceBack

import codecs
...
with codecs.open('file.csv','r',encoding='utf8') as f:
    text = f.read()

error using "nltk.word_tokenize()" function

Answers (1)

Related Questions

error using &quot;nltk.word_tokenize()&quot; function

Answers (1)

Related Questions

error using "nltk.word_tokenize()" function