Reputation: 1964
I would like to share numpy arrays between multiple processes. There are working solutions here. However they all pass the arrays to the child process through inheritance, which does not work for me because I have to start a few worker processes beforehand and I don't know how many arrays I'm going to deal with later on. Is there any way to create such arrays after the process is started and pass these arrays to the processes via queues?
Btw for some reason I'm not able to use multiprocessing.Manager
.
Upvotes: 6
Views: 2202
Reputation: 1235
You should use shared memory, which exactly solve your use case. You keep memory read/write speed, and all processes can read and write in the array in shared memory without incurring any serialization or transport cost.
Below is the example from the official python doc:
>>> # In the first Python interactive shell
>>> import numpy as np
>>> a = np.array([1, 1, 2, 3, 5, 8]) # Start with an existing NumPy array
>>> from multiprocessing import shared_memory
>>> shm = shared_memory.SharedMemory(create=True, size=a.nbytes)
>>> # Now create a NumPy array backed by shared memory
>>> b = np.ndarray(a.shape, dtype=a.dtype, buffer=shm.buf)
>>> b[:] = a[:] # Copy the original data into shared memory
>>> b
array([1, 1, 2, 3, 5, 8])
>>> type(b)
<class 'numpy.ndarray'>
>>> type(a)
<class 'numpy.ndarray'>
>>> shm.name # We did not specify a name so one was chosen for us
'psm_21467_46075'
>>> # In either the same shell or a new Python shell on the same machine
>>> import numpy as np
>>> from multiprocessing import shared_memory
>>> # Attach to the existing shared memory block
>>> existing_shm = shared_memory.SharedMemory(name='psm_21467_46075')
>>> # Note that a.shape is (6,) and a.dtype is np.int64 in this example
>>> c = np.ndarray((6,), dtype=np.int64, buffer=existing_shm.buf)
>>> c
array([1, 1, 2, 3, 5, 8])
>>> c[-1] = 888
>>> c
array([ 1, 1, 2, 3, 5, 888])
>>> # Back in the first Python interactive shell, b reflects this change
>>> b
array([ 1, 1, 2, 3, 5, 888])
>>> # Clean up from within the second Python shell
>>> del c # Unnecessary; merely emphasizing the array is no longer used
>>> existing_shm.close()
>>> # Clean up from within the first Python shell
>>> del b # Unnecessary; merely emphasizing the array is no longer used
>>> shm.close()
>>> shm.unlink() # Free and release the shared memory block at the very end
For a real use case as yours, you would need to pass the name shm.name
using a Pipe
or any other multi-processing communication mechanism. Note that only this tiny string will need to be exchanged between processes; the actual data stays in the shared memory space.
Upvotes: 3
Reputation: 3550
Depending on your exact use case, using np.memmap
for the arrays you want to transfer can be a good approch. The data will be on disk but it's used like a standard array, and only the "header" data is pickled in the Queue so it's very fast.
See https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html
Upvotes: 0