Reputation: 309
I have a simple multiprocessing code:
from multiprocessing import Pool
import time
def worker(data):
time.sleep(20)
if __name__ == "__main__":
numprocs = 10
pool = Pool(numprocs)
a = ['a' for i in xrange(1000000)]
b = [a+[] for i in xrange(100)]
data1 = [b+[] for i in range(numprocs)]
data2 = [data1+[]] + ['1' for i in range(numprocs-1)]
data3 = [['1'] for i in range(numprocs)]
#data = data1
#data = data2
data = data3
result = pool.map(worker,data)
b
is just a large list. data
is a list of length numprocs
passed to pool.map, so I expect numprocs
processes to be forked and each element of data
to be passed to one of those.
I test 3 different data
objects: data1
and data2
have practically the same size, but when using data1
, each process gets a copy of the same object, whereas when using data2
, one process gets all of data1
and others get just a '1' (basically nothing). data3
is basically empty to measure the basic overhead cost of forking processes.
Problem:
The overall memory used is vastly different between data1
and data2
. I measure the amount of additional memory used by the last line (pool.map()) and I get:
data1
: ~8GBdata2
: ~0.8GBdata3
: ~0GBShouldn't 1) and 2) be equal because the total amount of data passed to the children is the same. What is going on?
I measure memory usage from the Active
field of /proc/meminfo on a Linux machine (Total - MemFree
gives the same answer)
Upvotes: 1
Views: 325
Reputation: 94881
What you're seeing is a side effect of pool.map
having to pickle each element in data
in order to pass it to your child processes.
It turns out that data1
is much larger than data2
when you pickle it. Here's a small script that adds up the total pickled size of each element in each of your lists:
import pickle
import sys
from collections import OrderedDict
numprocs = 10
a = ['a' for i in range(1000000)]
b = [a+[] for i in range(100)]
data1 = [b+[] for i in range(numprocs)]
data2 = [data1+[]] + ['1' for i in range(numprocs-1)]
data3 = [['1'] for i in range(numprocs)]
sizes = OrderedDict()
for idx, data in enumerate((data1, data2, data3)):
sizes['data{}'.format(idx+1)] = sum(sys.getsizeof(pickle.dumps(d))
for d in data)
for k, v in sizes.items():
print("{}: {}".format(k, v))
Output:
data1: 2002003470
data2: 200202593
data3: 480
As you can see, the total pickled size of data1
is about ten times the size of data2
, which exactly matches with the order of magnitude difference in memory usage between the two.
The reason data2
pickles so much smaller is because of what Cilyan mentioned in the comments; you're actually making shallow copies when you create each data
list:
>>> data2[0][2][0] is data2[0][0][0]
True
>>> data2[0][2][0] is data2[0][3][0]
True
>>> data2[0][2][0] is data2[0][4][0]
True
Now, data1
is making shallow copies, too:
>>> data1[0][0] is data1[1][0]
True
>>> data1[0][0] is data1[2][0]
True
The key difference is that with data2
, all the shallow copies are inside of one top-level element of the iterable (data2[0]
). So when pool.map
pickles that element, it can de-duplicate all the shallow copies into one single sub-list, and just pickle that one sub-list, along with metadata the describes how that sub-list is nested into data2
. With data1
, the shallow copies are across different top-level elements of data1
, so pool.map
pickles them individually, which means that the de-duplication is lost. So, when you unpickle data1
, the shallow copies are gone, each element has a unique, but equal copy of the sub-list.
Compare these two examples, where data1
and data2
are identical to your lists:
>>> before_pickle1 = data1[0]
>>> before_pickle2 = data1[1]
>>> before_pickle1 is before_pickle2
False
>>> before_pickle1[0] is before_pickle2[0]
True # The lists are the same before pickling
>>> after_pickle1 = pickle.loads(pickle.dumps(before_pickle1))
>>> after_pickle2 = pickle.loads(pickle.dumps(before_pickle2))
>>> after_pickle1[0] is after_pickle2[0]
False # After pickling, the lists are not the same
>>> before_pickle1 = data2[0]
>>> before_pickle1[0][0] is before_pickle1[1][0]
True
>>> after_pickle1 = pickle.loads(pickle.dumps(before_pickle1))
>>> after_pickle1[0][0] is after_pickle1[1][0]
True # The lists are still the same after pickling
This simulates the same pickling pool.map
does. With data1
, all of the de-duplication you had because of the shallow copies are lost, which makes the memory usage much higher.
Upvotes: 4