Reputation: 1025
I have a dictionary my_dict
containing lists, and an iterable keys
with a lot of keys which I would like to run a function on:
for key in keys:
if key in my_dict:
my_dict[key].append(my_fun(key, params))
else:
my_dict[key] = [my_fun(key, params)]
my_fun
is slow. How do I parallellize this loop?
Is it just:
import multiprocessing
def _process_key(key):
if key in my_dict:
my_dict[key].append(my_fun(key, params))
else:
my_dict[key] = [my_fun(key, params)]
if __name__ == '__main__':
with Pool(5) as p:
p.map(_process_key, keys)
Upvotes: 0
Views: 200
Reputation: 40894
Python is not good at CPU-bound multithreadng, because of the GIL. If you want to speed up a CPU-bound computation, use multiprocessing
.
I would split the keys of your dictionary into as many lists as you have cores available. Then I would pass these lists to subprocesses, along with the original dictionary, or a relevant part of it (if values are large object graphs).
The subprocesses would return partial results, that the main process would merge into a single result.
For I/O-bound computations, the same approach would work using threading
, which could be faster because the data would be directly shared between threads.
The above is pretty generic. I don't know how to best partition your key space for even load and maximum speedup; you have to experiment.
Upvotes: 0
Reputation: 77347
The dict
is in the parent memory space so you need to update it there. pool.map
iterates through whatever is returned by the worker function, so just have it return it in a useful form. collections.defaultdict
is a helper that creates items for you, so you can
import multiprocessing
import collections
def _process_key(key):
return key, my_fun(key, params)
if __name__ == '__main__':
with Pool(5) as p:
my_dict = collections.defaultdict(list)
for key, val in p.map(_process_key, keys):
my_dict[key].append(val)
Upvotes: 2