Reputation: 8820
I have read about this post, Python multiprocessing: sharing a large read-only object between processes?, but still not sure how to proceed next.
Here is my problem:
I am analysing an array of millions of strings using multiprocessing
, and each string need to be checked against a big dict which consists of about 2 million (maybe higher) keys. Its values are objects of customized Python class called Bloomfilter
(so they're not just simple int or float, or array), and their sizes vary from a few bytes to 1.5 Gb. The analysis for each string is basically to check whether a string is in a certain number of bloomfilters in the dictionary. It depends on the string itself to decide which bloomfilters are relevant. The dictionary is a transformation of a 30G sqlite3 db. The motivation is to load the whole sqlite3 db into memory to speed up processing, but I haven't found a way to share the dict effectively. I have about 100G memory in my system.
Here is what I have tried:
The analysis for each string is CPU-bound, so I chose multiprocessing over multithreading. The key is how to share the big dict among the processes without copying. multiprocess.Value
and multiprocessing.Array
cannot deal with complex objects like a dict. I have tried multiprocessing.Manager()
, but since the dict is so big that I get IOError: bad message length
error. I have also tried using a in memory database like Redis on localhost, but the bitarray, which is used to construct a Bloomfilter after being fetched, is too big to fit in, either, which makes me think passing big messages among processes is just too expensive (is it?)
My Question:
What is the right way to share such the dictionary among different processes (or threads if there is a way to circumvent GIL)? If I need to use a database, which one should I use? I need very fast read and the database should be able to store very big values. (Though I don't think database would work because passing around very big values won't work, right? Please correct me if I am wrong)
Upvotes: 2
Views: 2604
Reputation: 8820
It turns out that both @Max and @Dunes are correct, but I don't need to either os.fork() directly or a global variable. Some pseudo-code is shown as below, as long as big_dict
isn't modified in the worker
, there appears to be only one copy in the memory. However, I am not sure if this copy-on-write feature is universal in the unix-like OS world. The OS I am running my code is CentOS release 5.10 (Final).
from multiprocessing import Process, Lock
def worker(pid, big_dict, lock):
# big_dict MUST NOT be modified in the worker because of copy-on-write
pass
# do some heavy work
def main():
big_dict = init_a_very_big_dict()
NUM_CPUS = 24
lock = Lock()
procs = []
for pid in range(NUM_CPUS):
proc = Process(target=worker, args=(pid, big_dict, lock))
proc.daemon = True
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
Upvotes: 2