How pool.map() allocates the work internally?

Question

I am quite new to multiprocessing library and have question with regards to its Pool module when used with map(). Suppose I have 4 worker threads and 6 tasks to be completed. What I do is (using multiprocessing.dummy because I want to spawn threads and not processes)

from multiprocessing.dummy import Pool as ThreadPool

def print_it(num):
    print num

def multi_threaded():
    tasks = [1, 2, 3, 4, 5, 6]
    pool = ThreadPool(4)
    r = pool.map(print_it, tasks)
    pool.close()
    pool.join()

multi_threaded()

I want to understand how Pool.map() handles the tasks? Three options :

Does it spawn 4 threads first, get the first 4 tasks complete and let the threads die. Then spawns 2 new threads for the remaining tasks?
Does it spawn 4 threads, assign 4 tasks to them, as soon as some thread completes its task, assign new task to the same thread.
Some other way.

This insight would be helpful as it will help me think of using Pool.map() more effectively in prod.

Hannu · Accepted Answer

It depends how you define your pool.

As you do it in your example, your (2) happens. Your threads or processes depending on Pool get launched as soon as you initialise your Pool (happens in Pool__init__() - no need to submit tasks for this to happen) and they sit there waiting for tasks. When a task arrives and is executed, threads or processes do not exit, they just go back to waiting state waiting for more work to come.

You can define it work differently, though. You can add maxtasksperchild parameter to your pool. As soon as a worker has completed this amount of tasks, it exits, and a new worker is immediately launched (no need to give it a task first, it gets launched as soon as a worker exits). This is managed in Pool class Pool._maintain_pool() and Pool._repopulate_pool() functions.

If you want your workers to launch at start and run indefinitely, do what you do now and this is what happens. If you want your workers to launch at start but exit and renew themselves after a number of tasks (even one if necessary), use maxtasksperchild. If you do not want to launch processes or threads before there is a need for them, do not use Pool. Launch threads or processes when you need them and manage them yourself.

Hope this helps.

How pool.map() allocates the work internally?

Answers (1)

Related Questions