leonsas
leonsas

Reputation: 4908

Pool generating infinitely many worker processes

I'm doing some slow computations using networkx (Network analysis library), and I'm trying to use Pool workers to make it somewhat faster. The computations are independent so should be relatively straightforward to parallelize them.

def computeInfoPooled(G,num_list):
    pool=Pool(processes=4)

    def f(k):
        curr_stat={}
        curr_stat[k]=slow_function(k,G)
        return curr_stat

    result = pool.map(f,num_list)

    return result

Now, I ran the following in console:

computed_result=computeInfoPooled(G)

I would expect this code to create 4 processes, and call f with every item (a number) of num_list in a different process. If num_list contains more than 4 numbers (in my case it's about 300), it would just run 4 at the same time and queue the rest until one of the pooled workers is done.

What happened when I ran my code is that many python.exe where being spawned (or forked, not sure what's happening) and it seemed that it was creating infinitely many processes, so I had to unplug my machine.

Any ideas what I'm doing wrong and how I could fix it?

Upvotes: 2

Views: 241

Answers (2)

abarnert
abarnert

Reputation: 366133

This is explained in the docs under Programming Guidelines for Windows.

Depending on your platform, each process may have to start a brand-new interpreter and import your module to get that f function to call. (On Windows, it always has to do so.) When it imports your module, all top-level code is run, which includes the line computed_result=computeInfoPooled(G), which creates a whole new pool of 4 processes, etc.

You get around this the same way you deal with any other case where you want the same file to be `both importable as a module and runnable as a script:

def computeInfoPooled(G,num_list):
    pool=Pool(processes=4)

    def f(k):
        curr_stat={}
        curr_stat[k]=slow_function(k,G)
        return curr_stat

    result = pool.map(f,num_list)

    return result

if __name__ == '__main__':
    computed_result=computeInfoPooled(G)

From your edit, and your comments, you seem to be expecting that doing the call to computeInfoPooled(G) from the interactive interpreter will solve that problem. The same linked docs section explains in detail why that don't work, and a big note at the very top of the Introduction directly says:

Functionality within this package requires that the main module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter.

If you want to understand why this is true, you need to read the linked docs (and you will also need to understand a little about how import, pickle, and multiprocessing all work).

Upvotes: 0

unutbu
unutbu

Reputation: 880937

On Windows you need

if __name__ == '__main__':
    computed_result = computeInfoPooled(G)

to make your script importable without starting a fork bomb. (Read the section entitled "Safe importing of main module" in the docs.

Also note that on Windows you may not be able to use the multiprocessing module from the interactive interpreter. See the warning near the top of the docs:

Functionality within this package requires that the main module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter. (my emphasis.)

Instead, save the script to a file, e.g. script.py and run it from the command line:

python script.py

In addition, you need the arguments to pool.map to be picklable. The function f needs to be defined at the module level (not inside computeInfoPooled to be picklable:

def f(k):
    curr_stat = slow_function(k, G)
    return k, curr_stat


def computeInfoPooled(G, num_list):
    pool = Pool(processes=4)
    result = pool.map(f, num_list)
    return dict(result)

if __name__ == '__main__':
    computed_result = computeInfoPooled(G)

By the way, if f returns a dict, then pool.map(f, ...) will return a list of dicts. I'm not sure that is what you'd want, especially since each dict would only have one key-value pair.

Instead, if you let f return a (key, value) tuple, then pool.map(f, ...) will return a list of tuples, which you could then turn into a dict with dict(result).

Upvotes: 1

Related Questions