Reputation: 7676
I want to use multiprocessing.Pool
to load a large dataset, here is the code I'm using:
import os
from os import listdir
import pickle
from os.path import join
import multiprocessing as mp
db_path = db_path
the_files = listdir(db_path)
fp_dict = {}
def loader(the_hash):
global fp_dict
the_file = join(db_path, the_hash)
with open(the_file, 'rb') as source:
fp_dict[the_hash] = pickle.load(source)
print(len(fp_dict))
def parallel(the_func, the_args):
global fp_dict
pool = mp.Pool(mp.cpu_count())
pool.map(the_func, the_args)
print(len(fp_dict))
parallel(loader, the_files)
Interestingly, the length of fp_dict
is changing while the code is running. However, as long as the process terminates, the length of fp_dict
is zero. Why? How can I modify a global variable by using multiprocessing.Pool
?
Upvotes: 2
Views: 2575
Reputation: 76725
Because you are using multiprocessing.Pool
your program runs in multiple processes. Each process has its own copy of the global variable, each process modifies its own copy of the global variable, and when the work is finished each process is terminated. The master process never modified its copy of the global variable.
If you want to collect information about what happened inside each worker process, you should use the .map()
method function, and return a tuple of data from each worker. Then have the master collect the tuples and put together a dictionary from the data.
Here's a YouTube tutorial that walks through using multiprocessing.Pool().map()
to collect the output from a worker function.
https://www.youtube.com/watch?v=_1ZwkCY9wxk
Here's another answer I wrote for StackOverflow, showing how to pass tuples so the worker function can take multiple arguments; and showing how to return a tuple with multiple values from the worker function. It even make a dictionary out of the returned values.
https://stackoverflow.com/a/11025090/166949
Upvotes: 7