Reputation: 25
So, maybe start from my code:
def download(fn, filename, index):
urllib.request.urlretrieve(fn,
os.path.join('music', re.sub('[%s]' % ''.join(CHAR_NOTALLOWED), '', filename) + '.mp3'))
print(str(index) + '# DOWNLOADED: ' + filename)
and
for index, d in enumerate(found):
worker = Thread(target=download, args=(found[d], d, index))
worker.setDaemon(True)
worker.start()
worker.join()
My problem is that when I tried to download over 1000 files I always get this error, but I don't know why:
Traceback (most recent call last):
File "E:/PythonProject/1.1/mp3y.py", line 238, in <module>
worker.start()
File "E:\python34\lib\threading.py", line 851, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
I tried using a queue, but got the same error.... I wanted part this thread but I don't know how :O
Upvotes: 1
Views: 2786
Reputation: 365707
Short version:
with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
for index, d in enumerate(found):
executor.submit(download, found[d], d, index)
That's it; a trivial change, and two lines less than your existing code, and you're done.
So, what's wrong with your existing code? Starting 1000 threads at a time is always a bad idea.* Once you get beyond a few dozen, you're adding more scheduler and context-switching overhead than you are concurrency savings.
If you want to know why it fails right around 1000, that could be because of a library working around older versions of Windows,**, or it could be because you're running out of stack space,***. But either way, it doesn't really matter. The right solution is to not use so many threads.
The usual solution is to use a thread pool—start about 8-12 threads,**** and have them pull the URLs to download off a queue. You can build this yourself, or you can use the concurrent.futures.ThreadPoolExecutor
or multiprocessing.dummy.Pool
that come with the stdlib. If you look at the main ThreadPoolExecutor
Example in the docs, it's doing almost exactly what you want. In fact, what you want is even simpler, because you don't care about the results.
As a side note, you've got another serious problem in your code. If you daemonize your threads, you're not allowed to join
them. Also, you're only trying to join the last one you created, which is by no means guaranteed to be the last one to finish. Also, daemonizing download threads is probably a bad idea in the first place, because when your main thread finishes (after waiting for one arbitrarily-chosen download to finish) the others may get interrupted and leave partial files behind.
Also, if you do want to daemonize a thread, the best way is to pass daemon=True
to the constructor. If you need to do it after creation, just do t.daemon = True
. Only call the deprecated setDaemon
function if you need backward compatibility to Python 2.5.
* I guess I shouldn't say always, because in 2025 it'll probably be an everyday thing to do, to take advantage of your thousands of slow cores. But in 2014 on normal laptop/desktop/server hardware, it's always bad.
** Older versions of Windows (at least NT 4) had all kinds of bizarre errors when you got close to 1024 threads, so many threading libraries just refuse to create more than 1000 threads. Although that doesn't seem to be the case here, as Python is just calling Microsoft's own wrapper function _beginthreadex
, which doesn't do that.
*** By default, each thread gets 1MB of stack space. And in 32-bit apps, there's a maximum total stack space, which I'd assume defaults to 1GB on your version of Windows. You can customize both the stack space for each thread, or the total process stack space, but Python doesn't customize either, nor do almost any other apps.
**** Unless your downloads are all coming off the same server, in which case you probably want at most 4, and really more than 2 is usually considered impolite if it's not your server. And why 8-12 anyway? It was a rule of thumb that tested well a long time ago. It's probably not optimal anymore, but it's probably close enough for most uses. If you really need to squeeze out a bit more performance, you can test with different numbers.
Upvotes: 4
Reputation: 1773
There is usually a limit on the maximum number of threads allowed. Depending on your system, this might be anywhere from a few dozen to thousands, but considering the number of files you are intending to download, don't expect you can create the same number of threads.
It is generally not a good idea to start 1000+ threads simultaneously each trying do download a file. Your connection will clog in no time, it's much less efficient than downloading a couple of files at a time, and apart from that, it wastes a lot of server resources and thus isn't considered very sociable.
The pattern used in a situation like this to create a small number of worker threads which each poll a queue.Queue
for files to download, then download a file, then poll the queue for the next file. The main program can now feed this queue from the original list, scheduling files for download until all downloads are complete.
A notable exception from this rule is if you are downloading files from a site which artificially throttles download speed. Especially video portals are known for doing so. In this case, it may be appropriate to use a significantly higher number of threads. In one case, when downloading from dailymotion, I found that a number of 20–30 threads worked best for me.
Upvotes: 4
Reputation: 52039
Using a queue will work, but you have to limit the number of worker threads you create. Here is code which uses 100 workers and a Queue
to process 1000 items of work:
import Queue
from threading import Thread
def main():
nworkers = 100
q = Queue.Queue(1000+nworkers)
# add the work
for i in range(1000):
q.put(i)
# add the stop signals
for i in range(nworkers):
q.put(-1)
# create and start up the threads
workers = []
for wid in range(nworkers):
w = Thread(target = dowork, args = (q, wid))
w.start()
workers.append(w)
# join all of the workers
for w in workers: w.join()
print "All done!"
def dowork(q, wid):
while True:
j = q.get()
if j < 0:
break
else:
print "Worker", wid, "processing item", j
print "Worker", wid, "exiting"
if __name__ == "__main__":
main()
Upvotes: 0