Umair Ayub
Umair Ayub

Reputation: 21261

'charmap' codec can't encode character '\xae' While Scraping a Webpage

I am web-scraping with Python using BeautifulSoap I am getting this error

'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>

when scraping a webpage

This is my Python

hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried:  print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')

Upvotes: 0

Views: 1388

Answers (1)

Irshad Bhat
Irshad Bhat

Reputation: 8709

We usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

As an example:

html = '\xae'
encoded_str = html.encode("utf8")

Fails with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®

Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

Upvotes: 1

Related Questions