Reputation: 35471

Downloading hundreds of files using `request` stalls in the middle.

I have the problem, that my code to download files from urls using requests stalls for no apparent reason. When I start the script it will download several hundred files, but then it just stops somewhere. If I try the url manually in the browser, the image loads w/o problem. I also tried with urllib.retrieve, but had the same problem. I use Python 2.7.5 on OSX.

Following you find

the code I use,
the stacktrace (dtruss), while the program is stalling and
the traceback, that is printed, when I ctrl-c the process after nothing happend for 10mins and

Code:

def download_from_url(url, download_path):
    with open(download_path, 'wb') as handle:
        response = requests.get(url, stream=True)
        for block in response.iter_content(1024):
            if not block:
                break
            handle.write(block)

def download_photos_from_urls(urls, concept):
    ensure_path_exists(concept)
    bad_results = list()
    for i, url in enumerate(urls):
        print i, url,
        download_path = concept+'/'+url.split('/')[-1]
        try:
            download_from_url(url, download_path)
            print
        except IOError as e:
            print str(e)
    return bad_result

stacktrace:

My-desk:~ Me$ sudo dtruss -p 708 SYSCALL(args) = return

Traceback:

318 http://farm1.static.flickr.com/32/47394454_10e6d7fd6d.jpg
Traceback (most recent call last):
  File "slow_download.py", line 71, in <module>
    if final_path == '':
  File "slow_download.py", line 34, in download_photos_from_urls
    download_path = concept+'/'+url.split('/')[-1]
  File "slow_download.py", line 21, in download_from_url
    with open(download_path, 'wb') as handle:
  File "/Library/Python/2.7/site-packages/requests/models.py", line 638, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 256, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 186, in read
    data = self._fp.read(amt)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
KeyboardInterrupt

Upvotes: 2

Answers (2)

MattDMo

Reputation: 102862

So, just to unify all the comments, and propose a potential solution: There are a couple of reasons why your downloads are failing after a few hundred - it may be internal to Python, such as hitting the maximum number of open file handles, or it may be an issue with the server blocking you for being a robot.

You didn't share all of your code, so it's a bit difficult to say, but at least with what you've shown you're using the with context manager when opening the files to write to, so you shouldn't run into problems there. There's the possibility that the request objects are not getting closed properly after exiting the loop, but I'll show you how to deal with that below.

The default requests User-Agent is (on my machine):

python-requests/2.4.1 CPython/3.4.1 Windows/8

so it's not too inconceivable to imagine the server(s) you're requesting from are screening for various UAs like this and limiting their number of connections. The reason you were able to also get the code to work with urllib.retrieve was that its UA is different than requests', so the server allowed it to continue for approximately the same number of requests, then shut it down, too.

To get around these issues, I suggest altering your download_from_url() function to something like this:

import requests
from time import sleep

def download_from_url(url, download_path, delay=5):
    headers = {'Accept-Encoding': 'identity, deflate, compress, gzip', 
               'Accept': '*/*',
               'Connection': 'keep-alive',
               'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0'}
    with open(download_path, 'wb') as handle:
        response = requests.get(url, headers=headers) # no stream=True, that could be an issue
        handle.write(response.content)
        response.close()
        sleep(delay)

Instead of using stream=True, we use the default value of False to immediately download the full content of the request. The headers dict contains a few default values, as well as the all-important 'User-Agent' value, which in this example happens to be my UA, determined by using What'sMyUserAgent. Feel free to change this to the one returned by your preferred browser. Instead of messing around with iterating through the content by 1KB blocks, here I just write the entire content to disk at once, eliminating extraneous code and some potential sources for errors - for example, if there was a hiccup in your network connectivity, you could temporarily have empty blocks, and break out out in error. I also explicitly close the request, just in case. Finally, I added an extra parameter to your function, delay, to make the function sleep for a certain number of seconds before returning. I gave it a default value of 5, you can make it whatever you want (it also accepts floats for fractional seconds).

I don't happen to have a large list of image URLs lying around to test this, but it should work as expected. Good luck!

Upvotes: 2

Wolph

Reputation: 80031

Perhaps the lack of pooling might cause too many connections. Try something like this (using a session):

import requests

session = requests.Session()

def download_from_url(url, download_path):
    with open(download_path, 'wb') as handle:
        response = session.get(url, stream=True)
        for block in response.iter_content(1024):
            if not block:
                break
            handle.write(block)

def download_photos_from_urls(urls, concept):
    ensure_path_exists(concept)
    bad_results = list()
    for i, url in enumerate(urls):
        print i, url,
        download_path = concept+'/'+url.split('/')[-1]
        try:
            download_from_url(url, download_path)
            print
        except IOError as e:
            print str(e)
    return bad_result

Upvotes: 1

Downloading hundreds of files using `request` stalls in the middle.

Answers (2)

Related Questions