Reputation: 14978
I am trying to access a page and it's html looks like:
?2?pɢ???=???I????܉??s???? [??AX#?`s??5???2`?| ,q?ɲ?=h?}VTŬ~?Y?}u3cx?pȢ?K_Ol&ɡ??'N??Y??n5?890??G???&$?%J#?ܩ?ѡ
1?y???
$] &'ι?\?~T?=??@N?C?$??K? ??iu"T?M
?6>?&5?:??sJ???xi???V??N??????3R7u??ǹ??7qs??<*????????@3?
EWu}??'F??Z??߶O?????Fc۰?S???h??/????h???[kS( f?\˹?@e???7_~~??*'?Jq??i?͛?J?W?T?Y]S??ӫ?~??kH??
w?L??ws??M?h?V?؊<[ ?
??A?G?w?
What's that? is it some encoding/decoding thing? how to view the html?
The code is here:
import requests
from bs4 import BeautifulSoup
import json
headers_initial = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'upgrade-insecure-requests': '1',
}
r = requests.get('https://www.example.com/', headers=headers_initial)
if r.status_code == 200:
html = r.text.strip()
print(html)
Upvotes: 0
Views: 221
Reputation: 9881
The problem comes from your headers. Just remove the accept-encoding
and it should work fine.
edit: the accept-encoding
specifies if we can handle compressed data. requests
doesn't, so if you need to specify the header, use the identity
property, meaning "just send me the page without compression".
Upvotes: 2