Pool generating infinitely many worker processes

Question

I'm doing some slow computations using networkx (Network analysis library), and I'm trying to use Pool workers to make it somewhat faster. The computations are independent so should be relatively straightforward to parallelize them.

def computeInfoPooled(G,num_list):
    pool=Pool(processes=4)

    def f(k):
        curr_stat={}
        curr_stat[k]=slow_function(k,G)
        return curr_stat

    result = pool.map(f,num_list)

    return result

Now, I ran the following in console:

computed_result=computeInfoPooled(G)

I would expect this code to create 4 processes, and call f with every item (a number) of num_list in a different process. If num_list contains more than 4 numbers (in my case it's about 300), it would just run 4 at the same time and queue the rest until one of the pooled workers is done.

What happened when I ran my code is that many python.exe where being spawned (or forked, not sure what's happening) and it seemed that it was creating infinitely many processes, so I had to unplug my machine.

Any ideas what I'm doing wrong and how I could fix it?

unutbu · Accepted Answer

On Windows you need

if __name__ == '__main__':
    computed_result = computeInfoPooled(G)

to make your script importable without starting a fork bomb. (Read the section entitled "Safe importing of main module" in the docs.

Also note that on Windows you may not be able to use the multiprocessing module from the interactive interpreter. See the warning near the top of the docs:

Functionality within this package requires that the main module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter. (my emphasis.)

Instead, save the script to a file, e.g. script.py and run it from the command line:

python script.py

In addition, you need the arguments to pool.map to be picklable. The function f needs to be defined at the module level (not inside computeInfoPooled to be picklable:

def f(k):
    curr_stat = slow_function(k, G)
    return k, curr_stat


def computeInfoPooled(G, num_list):
    pool = Pool(processes=4)
    result = pool.map(f, num_list)
    return dict(result)

if __name__ == '__main__':
    computed_result = computeInfoPooled(G)

By the way, if f returns a dict, then pool.map(f, ...) will return a list of dicts. I'm not sure that is what you'd want, especially since each dict would only have one key-value pair.

Instead, if you let f return a (key, value) tuple, then pool.map(f, ...) will return a list of tuples, which you could then turn into a dict with dict(result).

Pool generating infinitely many worker processes

Answers (2)

Related Questions