connesy
connesy

Reputation: 90

Does copying a dict to multiple processes also copy the objects in the dict?

I am running a function on multiple processes, which takes a dict of large pandas dataframes as input. When starting the processes, the dict is copied to each process, but as far as I understand, the dict only contais references to the dataframes, so the dataframes themselves are not copied to each process. Is this correct, or do each process get a deep copy of the dict?

import numpy as np
from multiprocessing import Pool, Process, Manager

def process_dataframe(df_dict, task_queue, result_queue):
    while True:
        try:
            test_value = task_queue.get(block=False)
        except:
            break
        else:
            result = {df_name: df[df==test_value].sum() for df_name, df in df_dict.items()}
            result_queue.put(result)



if __name__ == '__main__':
    manager = Manager()
    task_queue = manager.Queue()
    result_queue = manager.Queue()

    df_dict = {'df_1': some_large_df1, 'df_2': some_large_df2, ...}
    test_values = np.random.rand(1000)

    for val in test_values:
        task_queue.put(val)

    with Pool(processes=4) as pool:
        processes = []
        for _ in range(4):

            # Is df_dict copied shallow or deep to each process?
            p = pool.Process(target=process_dataframe, args=(df_dict,task_queue,result_queue))  
            processes.append(p)
            p.start()

        for p in processes:
            p.join()

    results = [result_queue.get(block=False) for _ in range(result_queue.qsize())]

Upvotes: 1

Views: 807

Answers (1)

TLDR: It does pass a copy interesting enough. But not in the normal way. The child and parent processes share the same memory unless one or the other changes an object (on systems that implement copy-on-write [both windows and linux have this]). In which case memory is allocated for the object that was changed.

I am a firm believer that it is better to see something in action rather than just be told, that said, let's get into it.

I pulled some example multiprocessing code from online for this. The sample code fits the bill for answering this question but it does not match the code from your question.

All of the following code is one script but I am going to break it apart to explain each part.


Start the Example:

First lets create a dictionary. I will use this instead of a DataFrame since they act similar but I do not need to install a package to use it.

Note: The id() syntax, it returns the unique identity of an object

# importing the multiprocessing module 
import multiprocessing
import sys # So we can see the memory we are using 

myDict = dict()
print("My dict ID is:", id(myDict))
myList = [0 for _ in range(10000)] # Just a massive list of all 0s
print('My list size in bytes:', sys.getsizeof(myList))
myDict[0] = myList
print("My dict size with the list:", sys.getsizeof(myDict))
print("My dict ID is still:", id(myDict))
print("But if I copied my dic it would be:", id(myDict.copy()))

For me this outputted:

My dict ID is: 139687265270016
My list size in bytes: 87624
My dict size with the list: 240
My dict ID is still: 139687265270016
But if I copied my dic it would be: 139687339197496

Cool so we see the id will change if we copy the object and we see that the dictionary is just holding a pointer to the list (thus the dict is significantly smaller in memory size).

Now let's look into if a Process copies a dictionary.

def method1(var): 
    print("method1 dict id is:", str(id(var)))

def method2(var): 
    print("method2 dict id is:", str(id(var))) 

if __name__ == "__main__": 
    # creating processes 
    p1 = multiprocessing.Process(target=method2, args=(myDict, )) 
    p2 = multiprocessing.Process(target=method1, args=(myDict, )) 

    # starting process 1 
    p1.start() 
    # starting process 2 
    p2.start() 

    # wait until process 1 is finished 
    p1.join() 
    # wait until process 2 is finished 
    p2.join() 

    # both processes finished 
    print("Done!") 

Here I am passing myDict as an arg to both my subprocess functions. This is what I get as an output:

method2 dict id is: 139687265270016
method1 dict id is: 139687265270016
Done!

Note: The id is the same as when we defined the dictionary earlier on in the code.

What does this mean?

If the id never changes then we are using the same object in all instances. So in theory if I make a change in Process it should change the main object. But that doesn't happen like we would expect.

For Example: Lets change our method1.

def method1(var): 
    print("method1 dict id is:", str(id(var)))
    var[0][0] = 1
    print("The first five elements of the list in the dict are:", var[0][:5])

AND add a couple prints after our p2.join():

p2.join()
print("The first five elements of the list in the dict are:", myDict[0][:5])
print("The first five elements of the list are:", myList[:5])

My dict ID is: 140077406931128
My list size in bytes: 87624
My dict size with the list: 240
My dict ID is still: 140077406931128
But if I copied my dic it would be: 140077455160376
method1 dict id is: 140077406931128
The first five elements of the list in the dict are: [1, 0, 0, 0, 0]
method2 dict id is: 140077406931128
The first five elements of the list in the dict are: [0, 0, 0, 0, 0]
The first five elements of the list are: [0, 0, 0, 0, 0]
Done!

Well thats interesting... The ids are the same, and I can change the object in the function but dict doesn't change in the main process...

Well after some further investigation I found this question/answer: https://stackoverflow.com/a/14750086/8150685

When creating a child process the child inherits a copy of the parents process (including a copy of the ids!); however, (provided the OS you are using impements a COW (copy-on-write) the child and parents will use the same memory unless either the child or the parent makes a change to the data in which case memory will be allocated only for the variable you changed (in your case it would make a copy of the DataFrame you changed).

Sorry for the long post, but I figured the work flow would be good to look at.

Hopefully this helped. Don't forget to upvote the question and answer at https://stackoverflow.com/a/14750086/8150685 if it helped you.

Upvotes: 1

Related Questions