Peter
Peter

Reputation: 427

Check if a https webpage exists in python

In a python 2.x script, I am looking for functionality to check if a https page returns particular content (will likely need to parse the page content to discover this). The page has a htpasswd prompt as well, which needs to be auth'd with a username and password to be able to see the content. So I suppose I am looking for a module or other functionality that provides the ability for me to hardcode a username and password so it can fetch the page and I can manipulate the output (aka check if the equivalent of keywords representing a 404 page are present).

I was having a look at http://docs.python.org/2/library/httplib.html but it doesn't seem to do what I am looking for.

Upvotes: 0

Views: 671

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121486

You could do it with the httplib module but there are easier methods that don't require manual driving of the HTTP protocol.

Using the requests library (an external module requiring installation first) is probably simplest:

import requests

auth = ('someusername', 'somepassword')
response = requests.get(yoururl, auth=auth)
response.raise_for_status()

This will raise an exception if the response wasn't successful or returned a 404 Not Found.

You can then further parse the response body with response.content (byte string) or response.text (a unicode response).

Using just the standard library, using the urllib2 module would look like:

import urllib2, base64

request = urllib2.Request(yoururl)
authstring = base64.encodestring('{}:{}'.format('someusername', 'somepassword')).strip()
request.add_header("Authorization", "Basic {}".format(authstring))   
response = urllib2.urlopen(request)

if not 200 <= response.getcode() < 400:
    # error response, raise an exception here?

content = response.read()
try:
    text = content.decode(response.info().getparam('charset', 'utf8'))
except UnicodeDecodeError:
    text = content.decode('ascii', 'replace')

where content is the byte string contents of the response body, and text would be the unicode value, up to a point.

Upvotes: 2

Related Questions