Reputation: 4562
I'm working on a Python 3 function to check various websites to make sure they are OK (200 responses, correct metadata, page size, etc). These sites are using different encodings. I'm using pycurl to decode the page bodies. According to the pycurl quickstart the page encoding (i.e. utf-8) needs to be passed before decoding.
How do I get the current encoding of a site before passing it for decoding? Is pycurl my best bet in Python 3 for comparing page content?
Upvotes: 0
Views: 468
Reputation: 25799
You usually determine the encoding based on the HTTP headers returned by the server. Instead of determining that yourself, use the requests
module which performs all that for you, so getting the content is as simple as:
import requests
req = requests.get("your_url")
if req.status_code == 200:
print(req.text) # print out the decoded content or do whatever you want with it
If the encoding is not present in the header itself then it gets a bit more complicated - you'll have to treat the response as ascii
encoded HTML, try to find a <meta http-equiv="Content-Type" ... />
tag and extract from its content
the encoding. Once you have it, you'll have to decode the content again with the encoding in question.
In the requests
response, the non-decoded content is available in req.content
so to get ascii
encoded HTML use req.content.decode("ascii")
then parse its HTML and seek for the codec (search SO on how to parse HTML in Python), and, finally, when you have the codec just re-decode the content with that codec: req.content.decode(your_discovered_codec)
to get to the properly decoded content.
Upvotes: 1