Boa
Boa

Reputation: 2677

Using certifi module with urllib2?

I'm having trouble downloading https pages with the urllib2 module, which seems to result from urllib2's inability to access the system's certificate store.

To get around this issue, one possible solution is to download https web pages with pycurl, by using the certifi module. The following is an example of doing so:

def download_web_page_with_curl(url_website):
    from pycurl import Curl, CAINFO, URL
    from certifi import where
    from cStringIO import StringIO

    response = StringIO()
    curl = Curl()
    curl.setopt(CAINFO, where())
    curl.setopt(URL, url_website)
    curl.setopt(curl.WRITEFUNCTION, response.write)
    curl.perform()
    curl.close()
    return response.getvalue()

Is there a way to use certifi with urllib2 (in a fashion comparable to the pycurl example above), which will permit me to download https sites? Alternatively, is there another feasible urllib2-based workaround which will remedy the permissions issue, without compromising security?

Upvotes: 4

Views: 1419

Answers (2)

wojtow
wojtow

Reputation: 976

Would recommend using requests per my other answer. However, to answer the original question of how to do this with urllib2:

import urllib2
import certifi
def download_web_page_with_urllib2(url_website):
    t = urllib2.urlopen(url_website, cafile=certifi.where())
    return t.read()
text = download_web_page_with_urllib2('https://www.google.com/')

The same recommendations about error checking apply.

Upvotes: 4

wojtow
wojtow

Reputation: 976

Expanding on the comment to use requests (which is built on urllib3):

def download_web_page_with_requests(url_website):
    import requests

    r = requests.get(url_website)
    return r.text

This is so much easier than anything else and properly handles SSL verification independent of the platform's own cert lists. If certifi is found, requests will automatically use it. If not, it silently falls back to a more limited, possibly older set of built-in root certs. If ensuring that certifi is used matters to you, you can do this:

r = requests.get(url_website, verify=certifi.where())

Note that the above code does not do the error checking that you should probably do. So I'll point out that requests.get() can throw a number of exceptions for invalid ULRs, unreachable sites, communication errors, and failed certification validation, so you should be prepared to catch and deal with those. If it does successfully talk to a server, but the server returns a non-OK status code (such as for a non-existent page), then an exception won't be thrown, so you'd also want to check that r.status_code==200.

Upvotes: 2

Related Questions