Multiprocessing code does not work when trying to initialize dataframe columns

Question

I am trying to use multiprocessing module to initialize each column of a dataframe using a separate CPU core in Python 3.6 but my code doesn't work. Does anybody know the issue with this code? I appreciate your help.

My laptop has Windows 10 and its CPU is Core i7 8th Gen:

import time        
import pandas as pd
import numpy as np
import multiprocessing 
df=pd.DataFrame(index=range(10),columns=["A","B","C","D"])


def multiprocessing_func(col):

    for i in range(0,df.shape[0]):
        df.iloc[i,col]=np.random(4)
    print("column "+str(col)+ " is completed" )


if __name__ == '__main__':
    starttime = time.time()
    processes = []
    for i in range(0,df.shape[1]):
        p = multiprocessing.Process(target=multiprocessing_func, args=(i,))
        processes.append(p)
        p.start()
    for process in processes:
        process.join()

    print('That took {} seconds'.format(time.time() - starttime))

Roland Smith · Accepted Answer

When you start a Process, it is basically a copy of the parent process. (I'm skipping over some details here, but they shouldn't matter for the explanation).

Unlike threads, processes don't share data. (Processes can use shared memory, but this is not automatic. To the best of my knowledge, the mechanisms in multiprocessing for sharing data cannot handle a dataframe.)

So what happens is that each of the worker processes is modifying its own copy of the dataframe, not the dataframe in the parent process.

For this to work, you'd have to send the new data back to the parent process. You could do that by e.g. return-ing it from the worker function, and then putting the returned data into the original dataframe.

It only makes sense to use multiprocessing like this if the work of generating the data takes significantly longer then launching a new worker process, sending the data back to the parent process and putting it into the dataframe. Since you are basically filling the columns with random data, I don't think that is the case here. So I don't see why you would use multiprocessing here.

Edit: Based on your comment that it takes days to calculate each column, I would propose the following.

Use Proces like you have been doing, but have each of the worker processes save the numbers they produce in a file where the filename includes the value of i. Have the workers return a status code so you can determine that thay have succeeded or failed. In case of failure, also return some kind of index of the amount of data successfully completed, so you don't have to re-calculate that again.

The file format should be simple and preferable readable. E.g. one number per line.

Wait for all processes to finish, read the files and fill the dataframe.

Multiprocessing code does not work when trying to initialize dataframe columns

Answers (1)

Related Questions