MrBlueCharon
MrBlueCharon

Reputation: 123

Is it possible and safe for Python multiprocessing to share the same input file and variables?

I've got a data file of a considerable size (around 1.5 million entries) and each entry is read and used as a base for calculations, whichs results are then added into a result data structure. The results of the single calculations may overlap in this data structure (eg multiple calculations will be added to one specific element of an array). This looks roughly like this:

import numpy as np

result = np.zeros(shape=(1000,1000))

with open("inputfile.txt", "r") as file:
    for line in file:
        n = calculateThings(float(line))
        result += n

def calculateThings(data):
    return np.where(some border condition, function(data), no mathematics)

The calculation by itself takes around 20-30 microseconds, which is okay. However, the size of the base data will make this a very time-intensive process, so the thought was to parallelize these calculations: Multiple processes split the file among each other, read it out parallely and they all add the results of their calculations to the result array.

However, due to the nature of multiple processes I am not sure whether this is possible (especially the parallel readout) and how much I can trust the accuracy of the result array, as I'm afraid that the processes mess with each other during reading or writing to the variable. Does anyone have experience with this specific problem?

Upvotes: 1

Views: 1096

Answers (2)

Booboo
Booboo

Reputation: 44108

Here is the problem: You are repeatedly adding some value to every element of the numpy array. If you had multiple processes doing this in parallel, you would have to be doing this under control of some sort of lock, which means that all the updating to the numpy array would be performed serially anyway. That means that you can only run in parallel the multiple calls to calculateThings, which computes all the addends. If this function is not sufficiently expensive in terms of CPU, then multiprocessing will not yield better performance.

What you need is to be able to store your array in shared memory so that it is sharable across all processes. I am assuming that the array must be a float type.

import numpy as np
import ctypes

ARRAY_TYPE = ctypes.c_double
SHAPE = (1000, 1000)

def np_array_from_shared_array(shared_array):
    return np.frombuffer(shared_array.get_obj(), ARRAY_TYPE).reshape(SHAPE[0], SHAPE[1])

def init_pool_processes(shared_array):
    """
    Init each pool process.
    The numpy array is created from the passed shared array and a global
    variable is initialized with a reference to it.
    """
    global result, lock

    result = np_array_from_shared_array(shared_array)
    lock = shared_array.get_lock()

def calculateThings(data):
    global result

    n = np.where(some_border_condition, function(data), no_mathematics)
    with lock:
        result += n

# Required for Windows:
if __name__ == '__main__':
    from multiprocessing import Pool, Array

    # Create shared memory version of a numpy array:
    shared_arr = Array(ARRAY_TYPE, SHAPE[0] * SHAPE[1])
    result = np_array_from_shared_array(shared_arr)

    with open("inputfile.txt", "r") as file:
        arguments = [float(line) for line in file]

    pool = Pool(initializer=init_pool_processes, initargs=(shared_arr,))
    pool.map(calculateThings, arguments)
    pool.close()
    pool.join()
    #print(result)

Upvotes: 2

murari prasad
murari prasad

Reputation: 86

Correct me if I am wrong. Processes do not share the same memory space, so each python object exists individually for every process, while using threads, each each shares the main memory space of the python module.

So I think if you split your ndarray into 'n' parts (which are separate objects) and just use threads, it will be alright.

Upvotes: 0

Related Questions