Reputation: 45
I'm trying to tokenize twitter text. When I apply the function nltk.word_tokenize() each single twitter text, it works perfectly even for some very ugly one such as
'\xd8\xb3\xd8\xa3\xd9\x87\xd9\x8e\xd9\x85\xd9\x90\xd8\xb3\xd9\x8f',
'\xd9\x82\xd9\x90\xd8\xb5\xd9\x8e\xd9\x91\xd8\xa9\xd9\x8b', '\xd8\xad\xd8\xaa\xd9\x89'
but when I loop through all the twitter in a file
tokens = []
for i in range(0,5047591):
s = ','.join(l_of_l[i])
tokens += nltk.word_tokenize(s)
it returns errors such as:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries for sl1, sl2 in _pair_iter(slices):
and many more
any suggestion about how to fix it?
Upvotes: 2
Views: 2502
Reputation: 15953
The problem you're getting is not from the code you included, it's from the code that include open()
command. The script is opening the file fine, but when you're accessing your data, it's give you that TraceBack
import codecs
...
with codecs.open('file.csv','r',encoding='utf8') as f:
text = f.read()
Upvotes: 2