Clovis
Clovis

Reputation: 193

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER

I've been coding in Python for about 3 months now and I am trying to do some web scraping for a class project, from the class "https://www.countyhealthrankings.org/app/alabama/2019/rankings/outcomes/". However, when were I try to pull the sites html I get the error, "Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER." I've experimented with a few different types of decoding, but no luck. The code that I am using is below. Any help would be greatly appreciated.

Code:

myurl = "https://www.countyhealthrankings.org/app/alabama/2019/rankings/outcomes/"
website = uReq(myurl)
website_html = website.read() #.decode(encoding="iso-8859-1")
print(website_html)
print(type(website_html))
website.close()
site_soup = soup(website_html,"html.parser")#, from_encoding="utf-8)
print(type(site_soup))

The html output will come out like below:

�}�WI���s����y��-H,��`�m���6����� 5�J�Řf����ȭJ%!��r���cTU�DFFDƖ�����?y�� ����۾W,��9����� ��y7����5ju���N���{�

The byte version of the data will look like this:

\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec}\xf9W\x1bI\xb2\xee\xefs\xce\xfb\x1f\xca\xeay\x83\xfd\x0

Upvotes: 3

Views: 3452

Answers (2)

Token Joe
Token Joe

Reputation: 177

I was not able to reproduce your error, but here is a working version that uses the requests package:

import requests
from bs4 import BeautifulSoup as bs

url = "https://www.countyhealthrankings.org/app/alabama/2019/rankings/outcomes/"

r = requests.get(url)

soup = bs(r.content)

print(soup)

Upvotes: 0

snakecharmerb
snakecharmerb

Reputation: 55589

The site is using gzip transfer encoding, and your http client does not decode it.

>> from urllib import request
>>> import gzip
>>> res = request.urlopen(url)
>>> content = res.read()
>>> content[:25]
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec}\xf9W\x1bI\xb2\xee\xefs\xce\xfb\x1f\xca\xea'

>>> print(res.headers)
...
Content-Encoding: gzip
Content-Language: en
Content-Type: text/html; charset=utf-8
...
>>> decompressed = gzip.decompress(content)
>>> decompressed[:25]
b'<!DOCTYPE html>\n<!--[if l'

The requests package handles this automatically

>>> import requests
>>> r = requests.get(url)
>>> r.text[:25]
'<!DOCTYPE html>\n<!--[if l'

Upvotes: 1

Related Questions