Python multiprocessing: can I reuse processes (already parallelized functions) with updated global variable?

Question

At first let me show you the current setup I have:

import multiprocessing.pool
from contextlib import closing
import os

def big_function(param):
   process(another_module.global_variable[param])


def dispatcher():
    # sharing read-only global variable taking benefit from Unix
    # which follows policy copy-on-update
    # https://stackoverflow.com/questions/19366259/
    another_module.global_variable = huge_list

    # send indices
    params = range(len(another_module.global_variable))

    with closing(multiprocessing.pool.Pool(processes=os.cpu_count())) as p:
        multiprocessing_result = list(p.imap_unordered(big_function, params))

    return multiprocessing_result

Here I use shared variable updated before creating process pool, which contains huge data, and that indeed gained me speedup, so it seem to be not pickled now. Also this variable belongs to the scope of an imported module (if it's important).

When I tried to create setup like this:

another_module.global_variable = []

p = multiprocessing.pool.Pool(processes=os.cpu_count())

def dispatcher():
    # sharing read-only global variable taking benefit from Unix
    # which follows policy copy-on-update
    # https://stackoverflow.com/questions/19366259/
    another_module_global_variable = huge_list

    # send indices
    params = range(len(another_module.global_variable))

    multiprocessing_result = list(p.imap_unordered(big_function, params))

    return multiprocessing_result

p "remembered" that global shared list was empty and refused to use new data when was called from inside the dispatcher.

Now here is the problem: processing ~600 data objects on 8 cores with the first setup above, my parallel computation runs 8 sec, while single-threaded it works 12 sec.

This is what I think: as long, as multiprocessing pickles data, and I need to re-create processes each time, I need to pickle function big_function(), so I lose time on that. The situation with data was partially solved using global variable (but I still need to recreate pool on each update of it).

What can I do with instances of big_function()(which depends on many other functions from other modules, numpy, etc)? Can I create os.cpu_count() of it's copies once and for all, and somehow feed new data into them and receive results, reusing workers?

Python multiprocessing: can I reuse processes (already parallelized functions) with updated global variable?

Answers (1)

Related Questions