mike_tech123
mike_tech123

Reputation: 105

urllib2 retrieve an arbitrary file based on URL and save it into a named file

I am writing a python script to use the urllib2 module as an equivalent to the command line utility wget. The only function I want for this is that it can be used to retrieve an arbitrary file based on URL and save it into a named file. I also only need to worry about two command line arguments, the URL from which the file is to be downloaded and the name of the file into which the content are to be saved.

Example:

python Prog7.py www.python.org pythonHomePage.html

This is my code:

import urllib
import urllib2
#import requests

url = 'http://www.python.org/pythonHomePage.html'

print "downloading with urllib"
urllib.urlretrieve(url, "code.txt")

print "downloading with urllib2"
f = urllib2.urlopen(url)
data = f.read()
with open("code2.txt", "wb") as code:
   code.write(data)

urllib seems to work but urllib2 does not seem to work.

Errors received:

 File "Problem7.py", line 11, in <module>
    f = urllib2.urlopen(url)
  File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.6/urllib2.py", line 429, in error
    result = self._call_chain(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 616, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.6/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: NOT FOUND

Upvotes: 3

Views: 1893

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121584

And the URL is doesn't exist at all; https://www.python.org/pythonHomePage.html is indeed a 404 Not Found page.

The difference between urllib and urllib2 then is that the latter automatically raises an exception when a 404 page is returned, while urllib.urlretrieve() just saves the error page for you:

>>> import urllib
>>> urllib.urlopen('https://www.python.org/pythonHomePage.html').getcode()
404
>>> import urllib2
>>> urllib2.urlopen('https://www.python.org/pythonHomePage.html')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: NOT FOUND

If you wanted to save the error page, you can catch the urllib2.HTTPError exception:

try:
   f = urllib2.urlopen(url)
   data = f.read()
except urllib2.HTTPError as err:
   data = err.read()

Upvotes: 1

Tom
Tom

Reputation: 1133

It is due to the different behavior by urllib and urllib2. Since the web page returns a 404 error (webpage not found) urllib2 "catches" it while urllib downloads the html of the returned page regardless of the error. If you want to print the html to the text file you can print the error:

import urllib2
try:
    data = urllib2.urlopen('http://www.python.org/pythonHomePage.html').read()
except urllib2.HTTPError, e:
    print e.code
    print e.msg
    print e.headers
    print e.fp.read()
    with open("code2.txt", "wb") as code:
      code.write(e.fp.read())

req will be a Request object, fp will be a file-like object with the HTTP error body, code will be the three-digit code of the error, msg will be the user-visible explanation of the code and hdrs will be a mapping object with the headers of the error.

More data about HTTP error: urllib2 documentation

Upvotes: 0

Related Questions