Reputation: 1509
I am trying to analyse a large shelve with multiprocessing.Pool
. Being with the read-only mode it should be thread safe, but it seams that a large object is being read first, then slowwwwly dispatched through the pool. Can this be done more efficiently?
Here is a minimal example of what I'm doing. Assume that test_file.shelf
already exist and is large (14GB+). I can see this snipped hugging 20GB of RAM, but only a small part of the shelve can be read at the same time (many more items than processors).
from multiprocessing import Pool
import shelve
def func(key_val):
print(key_val)
with shelve.open('test_file.shelf', flag='r') as shelf,\
Pool(4) as pool:
list(pool.imap_unordered(func, iter(shelf.items()))
Upvotes: 0
Views: 454
Reputation: 4489
Shelves are inherently not fast to open because they work as a dictionary like object and when you open them it takes a bit of time to load, especially for large shelves, due to the way it works in the backend. For it to function as a dictionary like object, each time you get a new item, that item is loaded into a separate dictionary in memory. lib reference
Also from the docs:
(imap) For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.
You are using the standard chunk size of 1 which is causing it take a long time to get through your very large shelve file. The docs suggest chunking instead of allowing it to send 1 at a time to speed it up.
The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.) When a program has a shelf open for writing, no other program should have it open for reading or writing. Unix file locking can be used to solve this, but this differs across Unix versions and requires knowledge about the database implementation used.
Last, just as a note, I'm not sure that your assumption about it being safe for multiprocessing is correct out of the box, depending on the implementation.
Edited: As pointed out by juanpa.arrivillaga, in this answer it describes what is happening in the backend -- your entire iterable may be being consumed up front which causes a large memory usage.
Upvotes: 1