Reputation: 5117
I am experimenting with some NLP algorithms and I am focusing now on sentiment analysis. For this reason, I downloaded from http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html some .review
format files with positive and negative reviews.
I am using BeautifulSoup
for parsing these XML files and for now I am only trying to read them by executing the following source code:
from bs4 import BeautifulSoup
positive_reviews = BeautifulSoup(open('*******/electronics/positive.review').read())
positive_reviews = positive_reviews.findAll('review_text')
negative_reviews = BeautifulSoup(open('*******/electronics/negative.review').read())
negative_reviews = negative_reviews.findAll('review_text')
However, I am getting the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 118374: ordinal not in range(128)
when
positive_reviews = BeautifulSoup(open('*******/electronics/positive.review').read())
is to be executed.
How can I fix this error?
I have also replaced
BeautifulSoup(open('*******/electronics/positive.review').read())
with
BeautifulSoup(open('*******/electronics/positive.review').read().decode('utf-8'))
but I am getting exactly the same error.
Finally, I have already read some relevant posts on StackOverflow but so far nothing worked for me. For example, at my terminal echo $LANG
outputs en_GB.UTF-8
as it is described at the first answer of UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1 but I am still getting the error above.
Upvotes: 0
Views: 711
Reputation: 1723
If you're using Python 3, try replacing
open('*******/electronics/positive.review')
with
open('*******/electronics/positive.review', encoding='utf-8')
Upvotes: 1