Outcast
Outcast

Reputation: 5117

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 118374: ordinal not in range(128)

I am experimenting with some NLP algorithms and I am focusing now on sentiment analysis. For this reason, I downloaded from http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html some .review format files with positive and negative reviews.

I am using BeautifulSoup for parsing these XML files and for now I am only trying to read them by executing the following source code:

from bs4 import BeautifulSoup

positive_reviews = BeautifulSoup(open('*******/electronics/positive.review').read())
positive_reviews = positive_reviews.findAll('review_text')

negative_reviews = BeautifulSoup(open('*******/electronics/negative.review').read())
negative_reviews = negative_reviews.findAll('review_text')

However, I am getting the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 118374: ordinal not in range(128)

when

positive_reviews = BeautifulSoup(open('*******/electronics/positive.review').read())

is to be executed.

How can I fix this error?

I have also replaced

BeautifulSoup(open('*******/electronics/positive.review').read())

with

BeautifulSoup(open('*******/electronics/positive.review').read().decode('utf-8'))

but I am getting exactly the same error.

Finally, I have already read some relevant posts on StackOverflow but so far nothing worked for me. For example, at my terminal echo $LANG outputs en_GB.UTF-8 as it is described at the first answer of UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1 but I am still getting the error above.

Upvotes: 0

Views: 711

Answers (1)

kristaps
kristaps

Reputation: 1723

If you're using Python 3, try replacing

open('*******/electronics/positive.review')

with

open('*******/electronics/positive.review', encoding='utf-8')

Upvotes: 1

Related Questions