Christian Chapman
Christian Chapman

Reputation: 1025

Parallelizing modifications to a dictionary

I have a dictionary my_dict containing lists, and an iterable keys with a lot of keys which I would like to run a function on:

for key in keys:
    if key in my_dict:
        my_dict[key].append(my_fun(key, params))
    else:
        my_dict[key] = [my_fun(key, params)]    

my_fun is slow. How do I parallellize this loop?


Is it just:

import multiprocessing

def _process_key(key): 
    if key in my_dict:
        my_dict[key].append(my_fun(key, params))
    else:
        my_dict[key] = [my_fun(key, params)]

if __name__ == '__main__':
with Pool(5) as p:
    p.map(_process_key, keys)

Upvotes: 0

Views: 200

Answers (2)

9000
9000

Reputation: 40894

Python is not good at CPU-bound multithreadng, because of the GIL. If you want to speed up a CPU-bound computation, use multiprocessing.

I would split the keys of your dictionary into as many lists as you have cores available. Then I would pass these lists to subprocesses, along with the original dictionary, or a relevant part of it (if values are large object graphs).

The subprocesses would return partial results, that the main process would merge into a single result.

For I/O-bound computations, the same approach would work using threading, which could be faster because the data would be directly shared between threads.

The above is pretty generic. I don't know how to best partition your key space for even load and maximum speedup; you have to experiment.

Upvotes: 0

tdelaney
tdelaney

Reputation: 77347

The dict is in the parent memory space so you need to update it there. pool.map iterates through whatever is returned by the worker function, so just have it return it in a useful form. collections.defaultdict is a helper that creates items for you, so you can

import multiprocessing
import collections

def _process_key(key): 
    return key, my_fun(key, params)

if __name__ == '__main__':
    with Pool(5) as p:
        my_dict = collections.defaultdict(list)
        for key, val in p.map(_process_key, keys):
            my_dict[key].append(val)

Upvotes: 2

Related Questions