How to share a very big dictionary among processes in Python

Question

I have read about this post, Python multiprocessing: sharing a large read-only object between processes?, but still not sure how to proceed next.

Here is my problem:

I am analysing an array of millions of strings using multiprocessing, and each string need to be checked against a big dict which consists of about 2 million (maybe higher) keys. Its values are objects of customized Python class called Bloomfilter (so they're not just simple int or float, or array), and their sizes vary from a few bytes to 1.5 Gb. The analysis for each string is basically to check whether a string is in a certain number of bloomfilters in the dictionary. It depends on the string itself to decide which bloomfilters are relevant. The dictionary is a transformation of a 30G sqlite3 db. The motivation is to load the whole sqlite3 db into memory to speed up processing, but I haven't found a way to share the dict effectively. I have about 100G memory in my system.

Here is what I have tried:

The analysis for each string is CPU-bound, so I chose multiprocessing over multithreading. The key is how to share the big dict among the processes without copying. multiprocess.Value and multiprocessing.Array cannot deal with complex objects like a dict. I have tried multiprocessing.Manager(), but since the dict is so big that I get IOError: bad message length error. I have also tried using a in memory database like Redis on localhost, but the bitarray, which is used to construct a Bloomfilter after being fetched, is too big to fit in, either, which makes me think passing big messages among processes is just too expensive (is it?)

My Question:

What is the right way to share such the dictionary among different processes (or threads if there is a way to circumvent GIL)? If I need to use a database, which one should I use? I need very fast read and the database should be able to store very big values. (Though I don't think database would work because passing around very big values won't work, right? Please correct me if I am wrong)

How to share a very big dictionary among processes in Python

Answers (1)

Related Questions