Istvan
Istvan

Reputation: 8562

How to tokenize unicode text with nltk?

I am trying to load a csv into a DataFrame and using it for NLP. I am getting a UnicodeDecodeError:

import pandas as pd
import nltk
df = DataFrame(pd.read_csv('1459966468_324.csv'))
df['tokenized_sents'] = df.apply(lambda row:nltk.word_tokenize(row['sentences']), axis=1)


UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 29: ordinal not in range(128)

Is there a way to process Unicode text with

Upvotes: 0

Views: 687

Answers (1)

Zeugma
Zeugma

Reputation: 32095

Use the encoding argument to tell pandas how to parse the file:

pd.read_csv('1459966468_324.csv', encoding='utf8')

Upvotes: 1

Related Questions