urllib2 retrieve an arbitrary file based on URL and save it into a named file

Question

I am writing a python script to use the urllib2 module as an equivalent to the command line utility wget. The only function I want for this is that it can be used to retrieve an arbitrary file based on URL and save it into a named file. I also only need to worry about two command line arguments, the URL from which the file is to be downloaded and the name of the file into which the content are to be saved.

Example:

python Prog7.py www.python.org pythonHomePage.html

This is my code:

import urllib
import urllib2
#import requests

url = 'http://www.python.org/pythonHomePage.html'

print "downloading with urllib"
urllib.urlretrieve(url, "code.txt")

print "downloading with urllib2"
f = urllib2.urlopen(url)
data = f.read()
with open("code2.txt", "wb") as code:
   code.write(data)

urllib seems to work but urllib2 does not seem to work.

Errors received:

 File "Problem7.py", line 11, in 
    f = urllib2.urlopen(url)
  File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.6/urllib2.py", line 429, in error
    result = self._call_chain(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 616, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.6/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: NOT FOUND

Tom · Accepted Answer

It is due to the different behavior by urllib and urllib2. Since the web page returns a 404 error (webpage not found) urllib2 "catches" it while urllib downloads the html of the returned page regardless of the error. If you want to print the html to the text file you can print the error:

import urllib2
try:
    data = urllib2.urlopen('http://www.python.org/pythonHomePage.html').read()
except urllib2.HTTPError, e:
    print e.code
    print e.msg
    print e.headers
    print e.fp.read()
    with open("code2.txt", "wb") as code:
      code.write(e.fp.read())

req will be a Request object, fp will be a file-like object with the HTTP error body, code will be the three-digit code of the error, msg will be the user-visible explanation of the code and hdrs will be a mapping object with the headers of the error.

More data about HTTP error: urllib2 documentation

urllib2 retrieve an arbitrary file based on URL and save it into a named file

Answers (2)

Related Questions