Reputation: 35471
I have the problem, that my code to download files from urls using requests
stalls for no apparent reason. When I start the script it will download several hundred files, but then it just stops somewhere. If I try the url manually in the browser, the image loads w/o problem. I also tried with urllib.retrieve, but had the same problem. I use Python 2.7.5 on OSX.
Following you find
dtruss
), while the program is stalling and ctrl-c
the process after nothing happend for 10mins andCode:
def download_from_url(url, download_path):
with open(download_path, 'wb') as handle:
response = requests.get(url, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
def download_photos_from_urls(urls, concept):
ensure_path_exists(concept)
bad_results = list()
for i, url in enumerate(urls):
print i, url,
download_path = concept+'/'+url.split('/')[-1]
try:
download_from_url(url, download_path)
print
except IOError as e:
print str(e)
return bad_result
stacktrace:
My-desk:~ Me$ sudo dtruss -p 708
SYSCALL(args) = return
Traceback:
318 http://farm1.static.flickr.com/32/47394454_10e6d7fd6d.jpg
Traceback (most recent call last):
File "slow_download.py", line 71, in <module>
if final_path == '':
File "slow_download.py", line 34, in download_photos_from_urls
download_path = concept+'/'+url.split('/')[-1]
File "slow_download.py", line 21, in download_from_url
with open(download_path, 'wb') as handle:
File "/Library/Python/2.7/site-packages/requests/models.py", line 638, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 256, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 186, in read
data = self._fp.read(amt)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 567, in read
s = self.fp.read(amt)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
KeyboardInterrupt
Upvotes: 2
Views: 1740
Reputation: 102862
So, just to unify all the comments, and propose a potential solution: There are a couple of reasons why your downloads are failing after a few hundred - it may be internal to Python, such as hitting the maximum number of open file handles, or it may be an issue with the server blocking you for being a robot.
You didn't share all of your code, so it's a bit difficult to say, but at least with what you've shown you're using the with
context manager when opening the files to write to, so you shouldn't run into problems there. There's the possibility that the request objects are not getting closed properly after exiting the loop, but I'll show you how to deal with that below.
The default requests
User-Agent is (on my machine):
python-requests/2.4.1 CPython/3.4.1 Windows/8
so it's not too inconceivable to imagine the server(s) you're requesting from are screening for various UAs like this and limiting their number of connections. The reason you were able to also get the code to work with urllib.retrieve
was that its UA is different than requests', so the server allowed it to continue for approximately the same number of requests, then shut it down, too.
To get around these issues, I suggest altering your download_from_url()
function to something like this:
import requests
from time import sleep
def download_from_url(url, download_path, delay=5):
headers = {'Accept-Encoding': 'identity, deflate, compress, gzip',
'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0'}
with open(download_path, 'wb') as handle:
response = requests.get(url, headers=headers) # no stream=True, that could be an issue
handle.write(response.content)
response.close()
sleep(delay)
Instead of using stream=True
, we use the default value of False
to immediately download the full content of the request. The headers
dict contains a few default values, as well as the all-important 'User-Agent'
value, which in this example happens to be my UA, determined by using What'sMyUserAgent. Feel free to change this to the one returned by your preferred browser. Instead of messing around with iterating through the content by 1KB blocks, here I just write the entire content to disk at once, eliminating extraneous code and some potential sources for errors - for example, if there was a hiccup in your network connectivity, you could temporarily have empty blocks, and break out out in error. I also explicitly close the request, just in case. Finally, I added an extra parameter to your function, delay
, to make the function sleep for a certain number of seconds before returning. I gave it a default value of 5, you can make it whatever you want (it also accepts floats for fractional seconds).
I don't happen to have a large list of image URLs lying around to test this, but it should work as expected. Good luck!
Upvotes: 2
Reputation: 80031
Perhaps the lack of pooling might cause too many connections. Try something like this (using a session):
import requests
session = requests.Session()
def download_from_url(url, download_path):
with open(download_path, 'wb') as handle:
response = session.get(url, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
def download_photos_from_urls(urls, concept):
ensure_path_exists(concept)
bad_results = list()
for i, url in enumerate(urls):
print i, url,
download_path = concept+'/'+url.split('/')[-1]
try:
download_from_url(url, download_path)
print
except IOError as e:
print str(e)
return bad_result
Upvotes: 1