Reputation: 37
I have multiple big numpy arrays:
x1=np.zeros((4000,4000))
x2=np.zeros((4000,4000))
x3=np.zeros((4000,4000))
.
.
.
xn=np.zeros((4000,4000))
and I want to execute a function with these arrays in parallel. because every Array is independent of the others, I thought I could use shared_memory so that a subprocess does not pickle the data.
Is it possible to create a big "shared variable" which contain the 3 big numpy arrays?
Inside the subprocess I would like to write directly into these arrays (without pickle them).
I think a would pass the subprocesses an idx (0,1,2...n) argument to reference to the x1,x2,x3...xn arrays?
Is this possible? I think one array is not the problem, but the multiple arrays multiprocessing is a bit confusing to me.
Thank you.
Upvotes: 2
Views: 3058
Reputation: 59691
This is how you could do it using an array of shared memory.
import numpy as np
import ctypes
from multiprocessing.sharedctypes import RawArray
from multiprocessing.pool import Pool
def main():
n = ... # Number of arrays
# Make arrays
x1 = np.zeros((4000, 4000), dtype=np.float64)
x2 = np.zeros((4000, 4000), dtype=np.float64)
# ...
xn = np.zeros((4000, 4000), dtype=np.float64)
# Make big array of shared memory (ctype must match array type)
array_mem = RawArray(ctypes.c_double, n * 4000 * 4000)
arr = np.frombuffer(array_mem, dtype=np.float64).reshape(n, 4000, 4000)
arr[0] = x1
arr[1] = x2
# ...
arr[n - 1] = xn
# Process array in a pool of processes
with Pool(initializer=init_process, initargs=(array_mem, arr.shape)) as p:
p.map(process_array, range(n))
# The array has been processed
# ...
print(*arr[:, :2, :3], sep='\n')
# [[0. 0. 0.]
# [0. 0. 0.]]
# [[100. 100. 100.]
# [100. 100. 100.]]
# [[200. 200. 200.]
# [200. 200. 200.]]
# ...
# Global values for subprocesses
process_array_mem = None
process_array_shape = None
# Process initializer saves memory pointer and array shape
def init_process(array_mem, array_shape):
global process_array_mem, process_array_shape
process_array_mem = array_mem
process_array_shape = array_shape
def process_array(array_idx):
# Create array from shared memory
arr = np.frombuffer(process_array_mem, dtype=np.float64).reshape(process_array_shape)
# Pick array for this process
process_array = arr[array_idx]
# Do processing
process_array += 100 * array_idx
if __name__ == '__main__':
main()
In the code above I put n = ...
to set the number of arrays to whatever value it has in your case, but if you change it to n = 3
and save the snippet as a file you can run it and see the result. The initializer and global values part may be a bit confusing, but the thing is array_mem
must be inherited by the subprocesses, which means I cannot pass it as another parameter with map
, and I think this is the simplest way to use it.
Upvotes: 2