Reputation: 12134
I have problems with my code.
#!/usr/bin/env python3.1
import urllib.request;
# Disguise as a Mozila browser on a Windows OS
userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)';
URL = "www.example.com/img";
req = urllib.request.Request(URL, headers={'User-Agent' : userAgent});
# Counter for the filename.
i = 0;
while True:
fname = str(i).zfill(3) + '.png';
req.full_url = URL + fname;
f = open(fname, 'wb');
try:
response = urllib.request.urlopen(req);
except:
break;
else:
f.write(response.read());
i+=1;
response.close();
finally:
f.close();
The problem seems to come when I create the urllib.request.Request object (called req). I create it with a non-existing url but later I change the url to what it should be. I'm doing this so that I can use the same urllib.request.Request object and not have to create new ones on each iteration. There is probably a mechanism for doing exactly that in python but I'm not sure what it is.
EDIT Error message is:
>>> response = urllib.request.urlopen(req);
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.1/urllib/request.py", line 121, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python3.1/urllib/request.py", line 356, in open
response = meth(req, response)
File "/usr/lib/python3.1/urllib/request.py", line 468, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.1/urllib/request.py", line 394, in error
return self._call_chain(*args)
File "/usr/lib/python3.1/urllib/request.py", line 328, in _call_chain
result = func(*args)
File "/usr/lib/python3.1/urllib/request.py", line 476, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
EDIT 2: My solution is the following. Probably should have done this at the start as I knew it would work:
import urllib.request;
# Disguise as a Mozila browser on a Windows OS
userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)';
# Counter for the filename.
i = 0;
while True:
fname = str(i).zfill(3) + '.png';
URL = "www.example.com/img" + fname;
f = open(fname, 'wb');
try:
req = urllib.request.Request(URL, headers={'User-Agent' : userAgent});
response = urllib.request.urlopen(req);
except:
break;
else:
f.write(response.read());
i+=1;
response.close();
finally:
f.close();
Upvotes: 1
Views: 8147
Reputation: 133
If you want to use the custom user agent with every request, you can subclass FancyURLopener
.
Here's an example: http://wolfprojects.altervista.org/changeua.php
Upvotes: 0
Reputation: 156158
urllib2
is fine for small scripts that only need to do one or two network interactions, but if you are doing a lot more work, you will likely find that either urllib3
, or requests
(which not coincidentally is built on the former), may suit your needs better. Your particular example might look like:
from itertools import count
import requests
HEADERS = {'user-agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
URL = "http://www.example.com/img%03d.png"
# with a session, we get keep alive
session = requests.session()
for n in count():
full_url = URL % n
ignored, filename = URL.rsplit('/', 1)
with file(filename, 'wb') as outfile:
response = session.get(full_url, headers=HEADERS)
if not response.ok:
break
outfile.write(response.content)
Edit: If you can use regular HTTP authentication (for which the 403 Forbidden
response strongly suggests), then you can add that to a requests.get
with the auth
parameter, as in:
response = session.get(full_url, headers=HEADERS, auth=('username','password))
Upvotes: 5
Reputation: 11381
Don't break when you receive an exception. Change
except:
break
to
except:
#Probably should log some debug information here.
pass
This will skip all problematic request, so that one doesn't bring down the whole process.
Upvotes: -2