zyxue
zyxue

Reputation: 8820

How to share a very big dictionary among processes in Python

I have read about this post, Python multiprocessing: sharing a large read-only object between processes?, but still not sure how to proceed next.

Here is my problem:

I am analysing an array of millions of strings using multiprocessing, and each string need to be checked against a big dict which consists of about 2 million (maybe higher) keys. Its values are objects of customized Python class called Bloomfilter (so they're not just simple int or float, or array), and their sizes vary from a few bytes to 1.5 Gb. The analysis for each string is basically to check whether a string is in a certain number of bloomfilters in the dictionary. It depends on the string itself to decide which bloomfilters are relevant. The dictionary is a transformation of a 30G sqlite3 db. The motivation is to load the whole sqlite3 db into memory to speed up processing, but I haven't found a way to share the dict effectively. I have about 100G memory in my system.

Here is what I have tried:

The analysis for each string is CPU-bound, so I chose multiprocessing over multithreading. The key is how to share the big dict among the processes without copying. multiprocess.Value and multiprocessing.Array cannot deal with complex objects like a dict. I have tried multiprocessing.Manager(), but since the dict is so big that I get IOError: bad message length error. I have also tried using a in memory database like Redis on localhost, but the bitarray, which is used to construct a Bloomfilter after being fetched, is too big to fit in, either, which makes me think passing big messages among processes is just too expensive (is it?)

My Question:

What is the right way to share such the dictionary among different processes (or threads if there is a way to circumvent GIL)? If I need to use a database, which one should I use? I need very fast read and the database should be able to store very big values. (Though I don't think database would work because passing around very big values won't work, right? Please correct me if I am wrong)

Upvotes: 2

Views: 2604

Answers (1)

zyxue
zyxue

Reputation: 8820

It turns out that both @Max and @Dunes are correct, but I don't need to either os.fork() directly or a global variable. Some pseudo-code is shown as below, as long as big_dict isn't modified in the worker, there appears to be only one copy in the memory. However, I am not sure if this copy-on-write feature is universal in the unix-like OS world. The OS I am running my code is CentOS release 5.10 (Final).

from multiprocessing import Process, Lock

def worker(pid, big_dict, lock):
    # big_dict MUST NOT be modified in the worker because of copy-on-write
    pass
    # do some heavy work

def main():
    big_dict = init_a_very_big_dict()

    NUM_CPUS = 24
    lock = Lock()
    procs = []
    for pid in range(NUM_CPUS):
        proc = Process(target=worker, args=(pid, big_dict, lock))
        proc.daemon = True
        procs.append(proc)
        proc.start()

    for proc in procs:
        proc.join()

Upvotes: 2

Related Questions