Reputation: 6238
I found this code and it seemed to reliable and efficient to me but unfortunately it's for python2 and also it uses urllib2 while everybody is saying requests is faster. What would be the equivalent code of the following (or something more efficient or more reliable) in python 3?
#!/usr/bin/env python
#-*- coding:utf-8 -*-
import sys
import urllib2
# This script uses HEAD requests (with fallback in case of 405)
# to follow the redirect path up to the real URL
# (c) 2012 Filippo Valsorda - FiloSottile
# Released under the GPL license
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
class HEADRedirectHandler(urllib2.HTTPRedirectHandler):
"""
Subclass the HTTPRedirectHandler to make it use our
HeadRequest also on the redirected URL
"""
def redirect_request(self, req, fp, code, msg, headers, newurl):
if code in (301, 302, 303, 307):
newurl = newurl.replace(' ', '%20')
newheaders = dict((k,v) for k,v in req.headers.items()
if k.lower() not in ("content-length", "content-type"))
return HeadRequest(newurl,
headers=newheaders,
origin_req_host=req.get_origin_req_host(),
unverifiable=True)
else:
raise urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
class HTTPMethodFallback(urllib2.BaseHandler):
"""
Fallback to GET if HEAD is not allowed (405 HTTP error)
"""
def http_error_405(self, req, fp, code, msg, headers):
fp.read()
fp.close()
newheaders = dict((k,v) for k,v in req.headers.items()
if k.lower() not in ("content-length", "content-type"))
return self.parent.open(urllib2.Request(req.get_full_url(),
headers=newheaders,
origin_req_host=req.get_origin_req_host(),
unverifiable=True))
# Build our opener
opener = urllib2.OpenerDirector()
for handler in [urllib2.HTTPHandler, urllib2.HTTPDefaultErrorHandler,
HTTPMethodFallback, HEADRedirectHandler,
urllib2.HTTPErrorProcessor, urllib2.HTTPSHandler]:
opener.add_handler(handler())
response = opener.open(HeadRequest(sys.argv[1]))
print(response.geturl())
By the way Head request is not actually what I need. I only want to know if the link is broken(In some sites if you give them a broken code they will redirect you back to the main page of the site and I want my code to recognize this too) and head request is the most efficient solution that came to my mind for this so if you know any better way I'd also appreciate that.
Upvotes: 0
Views: 2793
Reputation: 665
Take a look at Requests: http://docs.python-requests.org/en/master/
To do a HEAD request, you simply go:
import requests
r = requests.head('http://www.example.com')
Then you can access the object for what you need. For example, the status code:
print r.status_code
Update:
If you want to check to see if a page is live, you'll want to do a GET request. I've seen cases of HEAD requests returning a 200
response and, on the same URL, a GET request returning a 500
Upvotes: 1