Python Curl with Variable Encoding

Question

I'm working on a Python 3 function to check various websites to make sure they are OK (200 responses, correct metadata, page size, etc). These sites are using different encodings. I'm using pycurl to decode the page bodies. According to the pycurl quickstart the page encoding (i.e. utf-8) needs to be passed before decoding.

How do I get the current encoding of a site before passing it for decoding? Is pycurl my best bet in Python 3 for comparing page content?

zwer · Accepted Answer

You usually determine the encoding based on the HTTP headers returned by the server. Instead of determining that yourself, use the requests module which performs all that for you, so getting the content is as simple as:

import requests

req = requests.get("your_url")
if req.status_code == 200:
    print(req.text)  # print out the decoded content or do whatever you want with it

If the encoding is not present in the header itself then it gets a bit more complicated - you'll have to treat the response as ascii encoded HTML, try to find a tag and extract from its content the encoding. Once you have it, you'll have to decode the content again with the encoding in question.

In the requests response, the non-decoded content is available in req.content so to get ascii encoded HTML use req.content.decode("ascii") then parse its HTML and seek for the codec (search SO on how to parse HTML in Python), and, finally, when you have the codec just re-decode the content with that codec: req.content.decode(your_discovered_codec) to get to the properly decoded content.

Python Curl with Variable Encoding

Answers (1)

Related Questions