Reputation: 647
I'm trying to tokenize some documents but I have this error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 6: ordinal not in range(128)
import nltk
import pandas as pd
df = pd.DataFrame(pd.read_csv('status2.csv'))
documents = df['status']
result = [nltk.word_tokenize(sent) for sent in documents]
I think it's the unicode problem so I added
documents = unicode(documents, 'utf-8')
another error
TypeError: coercing to Unicode: need string or buffer, Series found
print documents
1 Brandon Cachia ,All I know is that,you're so n...
2 Melissa Zejtunija:HAM AND CHEESE BIEX INI??? *...
3 .........Where is my mind?????
4 Having a philosophical discussion with Trudy D...
Upvotes: 3
Views: 286
Reputation: 2153
unicode
operates on strings or bytes, but documents
is a pandas Series.
Maybe:
result = [nltk.word_tokenize(unicode(sent, 'utf-8')) for sent in documents]
Upvotes: 2