high5
high5

Reputation: 37

Multiprocessing multiple big numpy arrays as shared memory

I have multiple big numpy arrays:

x1=np.zeros((4000,4000))
x2=np.zeros((4000,4000))
x3=np.zeros((4000,4000))
.
.
.
xn=np.zeros((4000,4000))

and I want to execute a function with these arrays in parallel. because every Array is independent of the others, I thought I could use shared_memory so that a subprocess does not pickle the data.

Is it possible to create a big "shared variable" which contain the 3 big numpy arrays?

Inside the subprocess I would like to write directly into these arrays (without pickle them).

I think a would pass the subprocesses an idx (0,1,2...n) argument to reference to the x1,x2,x3...xn arrays?

Is this possible? I think one array is not the problem, but the multiple arrays multiprocessing is a bit confusing to me.

Thank you.

Upvotes: 2

Views: 3058

Answers (1)

javidcf
javidcf

Reputation: 59691

This is how you could do it using an array of shared memory.

import numpy as np
import ctypes
from multiprocessing.sharedctypes import RawArray
from multiprocessing.pool import Pool

def main():
    n = ...  # Number of arrays
    # Make arrays
    x1 = np.zeros((4000, 4000), dtype=np.float64)
    x2 = np.zeros((4000, 4000), dtype=np.float64)
    # ...
    xn = np.zeros((4000, 4000), dtype=np.float64)
    # Make big array of shared memory (ctype must match array type)
    array_mem = RawArray(ctypes.c_double, n * 4000 * 4000)
    arr = np.frombuffer(array_mem, dtype=np.float64).reshape(n, 4000, 4000)
    arr[0] = x1
    arr[1] = x2
    # ...
    arr[n - 1] = xn
    # Process array in a pool of processes
    with Pool(initializer=init_process, initargs=(array_mem, arr.shape)) as p:
        p.map(process_array, range(n))
    # The array has been processed
    # ...
    print(*arr[:, :2, :3], sep='\n')
    # [[0. 0. 0.]
    #  [0. 0. 0.]]
    # [[100. 100. 100.]
    #  [100. 100. 100.]]
    # [[200. 200. 200.]
    #  [200. 200. 200.]]
    # ...

# Global values for subprocesses
process_array_mem = None
process_array_shape = None

# Process initializer saves memory pointer and array shape
def init_process(array_mem, array_shape):
    global process_array_mem, process_array_shape
    process_array_mem = array_mem
    process_array_shape = array_shape

def process_array(array_idx):
    # Create array from shared memory
    arr = np.frombuffer(process_array_mem, dtype=np.float64).reshape(process_array_shape)
    # Pick array for this process
    process_array = arr[array_idx]
    # Do processing
    process_array += 100 * array_idx

if __name__ == '__main__':
    main()

In the code above I put n = ... to set the number of arrays to whatever value it has in your case, but if you change it to n = 3 and save the snippet as a file you can run it and see the result. The initializer and global values part may be a bit confusing, but the thing is array_mem must be inherited by the subprocesses, which means I cannot pass it as another parameter with map, and I think this is the simplest way to use it.

Upvotes: 2

Related Questions