Python multiprocessing on a map

Question

I have just studied this docu part enter link description here

So as far as I understood this function

import multiprocessing
pool = multiprocessing.Pool()
print pool.map(f, range(10))

would create a chunk of task which number is equivalent to the number of cores. And the result will be on the same order as it got the input from the sequnze.

The docu says also --- will block till complete:

Lets imagine f above is a complex function. We have 4 CPUS and a therefor a chunk size of 4, does it block until all 4 will have finished and only then get the next chunk?

So in worse case 3 free cores would idle a long time until that last one will finish?

ShadowRanger · Accepted Answer

You seem to be under the impression that the chunksize will match the number of cores. This is not correct. When not specified, chunksize has an implementation defined value, and it's not equal to the number of cores, at least on CPython (the reference interpreter). At time of writing, on both Python 2.7 and 3.7, the computation used is:

    if chunksize is None:
        chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
        if extra:
            chunksize += 1

len(self._pool) is the number of worker processes, len(iterable) is the number of items in the input iterable (which is listified if it didn't have a defined length).

So for your case, the calculation is:

        chunksize, extra = divmod(10, numcores * 4)
        if extra:
            chunksize += 1

which for a four core machine (for example), would compute chunksize, extra = 0, 10, and the if check would then change chunksize to 1. So each worker would take a single input value (0, 1, 2 and 3 would be grabbed almost immediately), then as each worker finished, it would grab one more item. Assuming all items take roughly the same amount of time, you'd do two rounds with full occupancy (4/4 cores in use), then one round with half occupancy (2/4 cores in use). Your worst case scenario is that the last task to begin takes the longest to run. If this is knowable ahead of time, you should try to organize your inputs to prevent that (placing the most expensive items first, so the final tasks run with incomplete occupancy are short and finish fast, maximizing parallelism); otherwise, it's pretty unavoidable.

For a larger number of tasks, yes, the default chunksize will increase, e.g. for 100 inputs on four cores, you'd have a chunksize of 7, producing 15 chunks, the last of which is undersized. So yes, for tasks with wildly varying runtimes, you'd risk a long tail with low occupancy. If that's a risk, explicitly set your chunksize to 1; it reduces overall performance (bringing it closer to that of imap), but it removes the possibility of one worker working on item 1 of 7 in a chunk with all the other cores sitting idle.

Python multiprocessing on a map

Answers (2)

Related Questions