Kailin Huang
Kailin Huang

Reputation: 45

error using "nltk.word_tokenize()" function

I'm trying to tokenize twitter text. When I apply the function nltk.word_tokenize() each single twitter text, it works perfectly even for some very ugly one such as

'\xd8\xb3\xd8\xa3\xd9\x87\xd9\x8e\xd9\x85\xd9\x90\xd8\xb3\xd9\x8f',
'\xd9\x82\xd9\x90\xd8\xb5\xd9\x8e\xd9\x91\xd8\xa9\xd9\x8b', '\xd8\xad\xd8\xaa\xd9\x89'

but when I loop through all the twitter in a file

tokens = []
for i in range(0,5047591):
    s = ','.join(l_of_l[i])
    tokens += nltk.word_tokenize(s)

it returns errors such as:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries for sl1, sl2 in _pair_iter(slices):

and many more

any suggestion about how to fix it?

Upvotes: 2

Views: 2502

Answers (1)

Leb
Leb

Reputation: 15953

The problem you're getting is not from the code you included, it's from the code that include open() command. The script is opening the file fine, but when you're accessing your data, it's give you that TraceBack

import codecs
...
with codecs.open('file.csv','r',encoding='utf8') as f:
    text = f.read()

Upvotes: 2

Related Questions