Reputation: 8650
If I use urllib to load this url( https://www.fundingcircle.com/my-account/sell-my-loans/ ) I get a 400 status error.
e.g. The following returns a 400 error
>>> import urllib
>>> f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
>>> print f.read()
However, if I copy and paste the url into my browser, I see a web page with the information that I want to see.
I have tried using a try, except, and then reading the error. But the returned data just tells me that the page does not exist. e.g.
import urllib
try:
f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
except Exception as e:
eString = e.read()
print eString
Why can't Python load the page?
Upvotes: 1
Views: 170
Reputation: 1123410
If Python is given a 404 status then that'd be because the server refuses to give you the page.
Why that is is difficult to know, because servers are black boxes. But your browser gives the server more than just the URL, it also gives it a set of HTTP headers. Most likely the server alters behaviour based on the contents of one or more of those headers.
You need to look in your browser development tools and see what your browser sends, then try and replicate some of those headers from Python. Obvious candidates are the User-Agent
header, followed by Accept
and Cookie
headers.
However, in this specific case, the server is responding with a 401 Unauthorized; you are given a login page. It does this both for the browser and Python:
>>> import urllib
>>> urllib.urlopen('https://www.fundingcircle.com/my-account/sell-my-loans/')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 451, in open_https
return self.http_error(url, fp, errcode, errmsg, headers)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 372, in http_error
result = method(url, fp, errcode, errmsg, headers)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 683, in http_error_401
errcode, errmsg, headers)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 381, in http_error_default
raise IOError, ('http error', errcode, errmsg, headers)
IOError: ('http error', 401, 'Unauthorized', <httplib.HTTPMessage instance at 0x1066f9a28>)
but Python's urllib
doesn't have a handler for the 401 status code and turns that into an exception.
The response body contains a login form; you'll have to write code to log in here, and presumably track cookies.
That task would be a lot easier with more specialised tools. You could use robobrowser to load the page, parse the form and give you the tools to fill it out, then post the form for you and track the cookies required to keep you logged in. It is built on top of the excellent requests and BeautifulSoup libraries.
Upvotes: 5