Reputation: 55
I am trying to check the status code of any URL in Python using the following code
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
when I use it like this:
response = urllib2.urlopen(HeadRequest("http://www.nativeseeds.org/"))
it throws following exception:
HTTPError: HTTP Error 503: Service Temporarily Unavailable
However when I open the above URL "http://www.nativeseeds.org/" in firefox/restclient, it returns 200 status code.
Any help will be highly appreciated.
Upvotes: 1
Views: 1473
Reputation: 67073
After some investigating, the website requires that both Accept
and User-Agent
request headers are set. Otherwise, it returns a 503. This is terribly broken. It also appears to be doing user-agent sniffing. I get a 403 when using curl:
$ curl --head http://www.nativeseeds.org/
HTTP/1.1 403 Forbidden
Date: Wed, 26 Sep 2012 14:54:59 GMT
Server: Apache
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Set-Cookie: f65129b0cd2c5e10c387f919ac90ad66=PjZxNjvNmn6IlVh4Ac-tH0; path=/
Vary: Accept-Encoding
Content-Type: text/html
but works fine if I set the user-agent to Firefox:
$ curl --user-agent "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" --head http://www.nativeseeds.org/
HTTP/1.1 200 OK
Date: Wed, 26 Sep 2012 14:55:57 GMT
Server: Apache
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Expires: Mon, 1 Jan 2001 00:00:00 GMT
Cache-Control: post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: f65129b0cd2c5e10c387f919ac90ad66=ykOpGnEE%2CQOMUaVJLnM7W0; path=/
Last-Modified: Wed, 26 Sep 2012 14:56:27 GMT
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
It appears to work using the requests module:
>>> import requests
>>> r = requests.head('http://www.nativeseeds.org/')
>>> import pprint; pprint.pprint(r.headers)
{'cache-control': 'post-check=0, pre-check=0',
'content-encoding': 'gzip',
'content-length': '20',
'content-type': 'text/html; charset=utf-8',
'date': 'Wed, 26 Sep 2012 14:58:05 GMT',
'expires': 'Mon, 1 Jan 2001 00:00:00 GMT',
'last-modified': 'Wed, 26 Sep 2012 14:58:09 GMT',
'p3p': 'CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"',
'pragma': 'no-cache',
'server': 'Apache',
'set-cookie': 'f65129b0cd2c5e10c387f919ac90ad66=2NtRrDBra9jPsszChZXDm2; path=/',
'vary': 'Accept-Encoding'}
Upvotes: 4
Reputation: 116
Reading urllib2 docs, get_method only returns 'GET' or 'POST'.
You may be interested in this.
Upvotes: 0
Reputation: 1122142
The problem you see has nothing to do with Python. The website itself seems to require something more than just a HEAD request. Even a simple telnet session results in the error:
$ telnet www.nativeseeds.org 80
Trying 208.113.230.85...
Connected to www.nativeseeds.org (208.113.230.85).
Escape character is '^]'.
HEAD / HTTP/1.1
Host: www.nativeseeds.org
HTTP/1.1 503 Service Temporarily Unavailable
Date: Wed, 26 Sep 2012 14:29:33 GMT
Server: Apache
Vary: Accept-Encoding
Connection: close
Content-Type: text/html; charset=iso-8859-1
Try adding some more headers; the http
command line client does get a 200 response:
$ http -v head http://www.nativeseeds.org
HEAD / HTTP/1.1
Host: www.nativeseeds.org
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Accept-Encoding: identity, deflate, compress, gzip
Accept: */*
User-Agent: HTTPie/0.2.2
HTTP/1.1 200 OK
Date: Wed, 26 Sep 2012 14:33:21 GMT
Server: Apache
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Expires: Mon, 1 Jan 2001 00:00:00 GMT
Cache-Control: post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: f65129b0cd2c5e10c387f919ac90ad66=34hOijDSzeskKYtULx9V83; path=/
Last-Modified: Wed, 26 Sep 2012 14:33:23 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 20
Content-Type: text/html; charset=utf-8
Upvotes: 3