Ginger
Ginger

Reputation: 8650

Why Can't I load this page using Python?

If I use urllib to load this url( https://www.fundingcircle.com/my-account/sell-my-loans/ ) I get a 400 status error.

e.g. The following returns a 400 error

>>> import urllib
>>> f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
>>> print f.read()

However, if I copy and paste the url into my browser, I see a web page with the information that I want to see.

I have tried using a try, except, and then reading the error. But the returned data just tells me that the page does not exist. e.g.

import urllib
try:
    f = urllib.urlopen("https://www.fundingcircle.com/my-account/sell-my-loans/")
except Exception as e:
    eString = e.read()
    print eString

Why can't Python load the page?

Upvotes: 1

Views: 170

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123410

If Python is given a 404 status then that'd be because the server refuses to give you the page.

Why that is is difficult to know, because servers are black boxes. But your browser gives the server more than just the URL, it also gives it a set of HTTP headers. Most likely the server alters behaviour based on the contents of one or more of those headers.

You need to look in your browser development tools and see what your browser sends, then try and replicate some of those headers from Python. Obvious candidates are the User-Agent header, followed by Accept and Cookie headers.

However, in this specific case, the server is responding with a 401 Unauthorized; you are given a login page. It does this both for the browser and Python:

>>> import urllib
>>> urllib.urlopen('https://www.fundingcircle.com/my-account/sell-my-loans/')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 208, in open
    return getattr(self, name)(url)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 451, in open_https
    return self.http_error(url, fp, errcode, errmsg, headers)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 372, in http_error
    result = method(url, fp, errcode, errmsg, headers)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 683, in http_error_401
    errcode, errmsg, headers)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib.py", line 381, in http_error_default
    raise IOError, ('http error', errcode, errmsg, headers)
IOError: ('http error', 401, 'Unauthorized', <httplib.HTTPMessage instance at 0x1066f9a28>)

but Python's urllib doesn't have a handler for the 401 status code and turns that into an exception.

The response body contains a login form; you'll have to write code to log in here, and presumably track cookies.

That task would be a lot easier with more specialised tools. You could use robobrowser to load the page, parse the form and give you the tools to fill it out, then post the form for you and track the cookies required to keep you logged in. It is built on top of the excellent requests and BeautifulSoup libraries.

Upvotes: 5

Related Questions