Reputation: 90
I am running a function on multiple processes, which takes a dict of large pandas dataframes as input. When starting the processes, the dict is copied to each process, but as far as I understand, the dict only contais references to the dataframes, so the dataframes themselves are not copied to each process. Is this correct, or do each process get a deep copy of the dict?
import numpy as np
from multiprocessing import Pool, Process, Manager
def process_dataframe(df_dict, task_queue, result_queue):
while True:
try:
test_value = task_queue.get(block=False)
except:
break
else:
result = {df_name: df[df==test_value].sum() for df_name, df in df_dict.items()}
result_queue.put(result)
if __name__ == '__main__':
manager = Manager()
task_queue = manager.Queue()
result_queue = manager.Queue()
df_dict = {'df_1': some_large_df1, 'df_2': some_large_df2, ...}
test_values = np.random.rand(1000)
for val in test_values:
task_queue.put(val)
with Pool(processes=4) as pool:
processes = []
for _ in range(4):
# Is df_dict copied shallow or deep to each process?
p = pool.Process(target=process_dataframe, args=(df_dict,task_queue,result_queue))
processes.append(p)
p.start()
for p in processes:
p.join()
results = [result_queue.get(block=False) for _ in range(result_queue.qsize())]
Upvotes: 1
Views: 807
Reputation: 7887
TLDR: It does pass a copy interesting enough. But not in the normal way. The child and parent processes share the same memory unless one or the other changes an object (on systems that implement copy-on-write [both windows and linux have this]). In which case memory is allocated for the object that was changed.
I am a firm believer that it is better to see something in action rather than just be told, that said, let's get into it.
I pulled some example multiprocessing
code from online for this. The sample code fits the bill for answering this question but it does not match the code from your question.
All of the following code is one script but I am going to break it apart to explain each part.
First lets create a dictionary
. I will use this instead of a DataFrame
since they act similar but I do not need to install a package to use it.
Note: The id()
syntax, it returns the unique identity of an object
# importing the multiprocessing module
import multiprocessing
import sys # So we can see the memory we are using
myDict = dict()
print("My dict ID is:", id(myDict))
myList = [0 for _ in range(10000)] # Just a massive list of all 0s
print('My list size in bytes:', sys.getsizeof(myList))
myDict[0] = myList
print("My dict size with the list:", sys.getsizeof(myDict))
print("My dict ID is still:", id(myDict))
print("But if I copied my dic it would be:", id(myDict.copy()))
For me this outputted:
My dict ID is: 139687265270016
My list size in bytes: 87624
My dict size with the list: 240
My dict ID is still: 139687265270016
But if I copied my dic it would be: 139687339197496
Cool so we see the id
will change if we copy the object and we see that the dictionary
is just holding a pointer to the list
(thus the dict
is significantly smaller in memory size).
Now let's look into if a Process
copies a dictionary.
def method1(var):
print("method1 dict id is:", str(id(var)))
def method2(var):
print("method2 dict id is:", str(id(var)))
if __name__ == "__main__":
# creating processes
p1 = multiprocessing.Process(target=method2, args=(myDict, ))
p2 = multiprocessing.Process(target=method1, args=(myDict, ))
# starting process 1
p1.start()
# starting process 2
p2.start()
# wait until process 1 is finished
p1.join()
# wait until process 2 is finished
p2.join()
# both processes finished
print("Done!")
Here I am passing myDict
as an arg
to both my subprocess functions.
This is what I get as an output:
method2 dict id is: 139687265270016
method1 dict id is: 139687265270016
Done!
Note: The id
is the same as when we defined the dictionary earlier on in the code.
If the id
never changes then we are using the same object in all instances. So in theory if I make a change in Process
it should change the main object. But that doesn't happen like we would expect.
For Example: Lets change our method1
.
def method1(var):
print("method1 dict id is:", str(id(var)))
var[0][0] = 1
print("The first five elements of the list in the dict are:", var[0][:5])
AND
add a couple print
s after our p2.join()
:
p2.join()
print("The first five elements of the list in the dict are:", myDict[0][:5])
print("The first five elements of the list are:", myList[:5])
My dict ID is: 140077406931128
My list size in bytes: 87624
My dict size with the list: 240
My dict ID is still: 140077406931128
But if I copied my dic it would be: 140077455160376
method1 dict id is: 140077406931128
The first five elements of the list in the dict are: [1, 0, 0, 0, 0]
method2 dict id is: 140077406931128
The first five elements of the list in the dict are: [0, 0, 0, 0, 0]
The first five elements of the list are: [0, 0, 0, 0, 0]
Done!
Well thats interesting... The id
s are the same, and I can change the object in the function but dict
doesn't change in the main process...
Well after some further investigation I found this question/answer: https://stackoverflow.com/a/14750086/8150685
When creating a child process the child inherits a copy of the parents process (including a copy of the id
s!); however, (provided the OS you are using impements a COW (copy-on-write) the child and parents will use the same memory unless either the child or the parent makes a change to the data in which case memory will be allocated only for the variable you changed (in your case it would make a copy of the DataFrame
you changed).
Sorry for the long post, but I figured the work flow would be good to look at.
Hopefully this helped. Don't forget to upvote the question and answer at https://stackoverflow.com/a/14750086/8150685 if it helped you.
Upvotes: 1