heroZero
heroZero

Reputation: 173

UnicodeDecodeError with nltk

I am working with python2.7 and nltk on a large txt file of content scraped from various websites..however I am getting various unicode errors such as

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)

My question is not so much how I can 'fix' this with python but instead is there anything I can do to the .txt file (as in formatting) before 'feeding' it to python, such as 'make plain text' to avoid this issue entirely?

Update:

I looked around and found a solution within python that seems to work perfectly:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

Upvotes: 1

Views: 70

Answers (1)

try opening the file with:

f = open(fname, encoding="ascii", errors="surrogateescape")

Change the "ascii" with the desired encoding.

Upvotes: 1

Related Questions