Reputation: 2677
I'm having trouble downloading https pages with the urllib2 module, which seems to result from urllib2's inability to access the system's certificate store.
To get around this issue, one possible solution is to download https web pages with pycurl, by using the certifi module. The following is an example of doing so:
def download_web_page_with_curl(url_website):
from pycurl import Curl, CAINFO, URL
from certifi import where
from cStringIO import StringIO
response = StringIO()
curl = Curl()
curl.setopt(CAINFO, where())
curl.setopt(URL, url_website)
curl.setopt(curl.WRITEFUNCTION, response.write)
curl.perform()
curl.close()
return response.getvalue()
Is there a way to use certifi with urllib2 (in a fashion comparable to the pycurl example above), which will permit me to download https sites? Alternatively, is there another feasible urllib2-based workaround which will remedy the permissions issue, without compromising security?
Upvotes: 4
Views: 1419
Reputation: 976
Would recommend using requests per my other answer. However, to answer the original question of how to do this with urllib2:
import urllib2
import certifi
def download_web_page_with_urllib2(url_website):
t = urllib2.urlopen(url_website, cafile=certifi.where())
return t.read()
text = download_web_page_with_urllib2('https://www.google.com/')
The same recommendations about error checking apply.
Upvotes: 4
Reputation: 976
Expanding on the comment to use requests (which is built on urllib3):
def download_web_page_with_requests(url_website):
import requests
r = requests.get(url_website)
return r.text
This is so much easier than anything else and properly handles SSL verification independent of the platform's own cert lists. If certifi is found, requests will automatically use it. If not, it silently falls back to a more limited, possibly older set of built-in root certs. If ensuring that certifi is used matters to you, you can do this:
r = requests.get(url_website, verify=certifi.where())
Note that the above code does not do the error checking that you should probably do. So I'll point out that requests.get() can throw a number of exceptions for invalid ULRs, unreachable sites, communication errors, and failed certification validation, so you should be prepared to catch and deal with those. If it does successfully talk to a server, but the server returns a non-OK status code (such as for a non-existent page), then an exception won't be thrown, so you'd also want to check that r.status_code==200.
Upvotes: 2