python multiprocessing peculiar memory management

Question

I have a simple multiprocessing code:

from multiprocessing import Pool
import time 

def worker(data):
    time.sleep(20)

if __name__ == "__main__":
    numprocs = 10
    pool = Pool(numprocs)
    a = ['a' for i in xrange(1000000)]
    b = [a+[] for i in xrange(100)]

    data1 = [b+[] for i in range(numprocs)]
    data2 = [data1+[]] + ['1' for i in range(numprocs-1)]
    data3 = [['1'] for i in range(numprocs)]

    #data = data1
    #data = data2
    data = data3


    result = pool.map(worker,data)

b is just a large list. data is a list of length numprocs passed to pool.map, so I expect numprocs processes to be forked and each element of data to be passed to one of those. I test 3 different data objects: data1 and data2 have practically the same size, but when using data1, each process gets a copy of the same object, whereas when using data2, one process gets all of data1 and others get just a '1' (basically nothing). data3 is basically empty to measure the basic overhead cost of forking processes.

Problem: The overall memory used is vastly different between data1 and data2. I measure the amount of additional memory used by the last line (pool.map()) and I get:

data1: ~8GB
data2: ~0.8GB
data3: ~0GB

Shouldn't 1) and 2) be equal because the total amount of data passed to the children is the same. What is going on?

I measure memory usage from the Active field of /proc/meminfo on a Linux machine (Total - MemFree gives the same answer)

dano · Accepted Answer

What you're seeing is a side effect of pool.map having to pickle each element in data in order to pass it to your child processes.

It turns out that data1 is much larger than data2 when you pickle it. Here's a small script that adds up the total pickled size of each element in each of your lists:

import pickle
import sys
from collections import OrderedDict

numprocs = 10
a = ['a' for i in range(1000000)]
b = [a+[] for i in range(100)]

data1 = [b+[] for i in range(numprocs)]
data2 = [data1+[]] + ['1' for i in range(numprocs-1)]
data3 = [['1'] for i in range(numprocs)]
sizes = OrderedDict()
for idx, data in enumerate((data1, data2, data3)):
    sizes['data{}'.format(idx+1)] = sum(sys.getsizeof(pickle.dumps(d))
                                            for d in data)

for k, v in sizes.items():
    print("{}: {}".format(k, v))

Output:

data1: 2002003470
data2: 200202593
data3: 480

As you can see, the total pickled size of data1 is about ten times the size of data2, which exactly matches with the order of magnitude difference in memory usage between the two.

The reason data2 pickles so much smaller is because of what Cilyan mentioned in the comments; you're actually making shallow copies when you create each data list:

>>> data2[0][2][0] is data2[0][0][0]
True
>>> data2[0][2][0] is data2[0][3][0]
True
>>> data2[0][2][0] is data2[0][4][0]
True

Now, data1 is making shallow copies, too:

>>> data1[0][0] is data1[1][0]
True
>>> data1[0][0] is data1[2][0]
True

The key difference is that with data2, all the shallow copies are inside of one top-level element of the iterable (data2[0]). So when pool.map pickles that element, it can de-duplicate all the shallow copies into one single sub-list, and just pickle that one sub-list, along with metadata the describes how that sub-list is nested into data2. With data1, the shallow copies are across different top-level elements of data1, so pool.map pickles them individually, which means that the de-duplication is lost. So, when you unpickle data1, the shallow copies are gone, each element has a unique, but equal copy of the sub-list.

Compare these two examples, where data1 and data2 are identical to your lists:

>>> before_pickle1 = data1[0]
>>> before_pickle2 = data1[1]
>>> before_pickle1 is before_pickle2
False
>>> before_pickle1[0] is before_pickle2[0]
True  # The lists are the same before pickling
>>> after_pickle1 = pickle.loads(pickle.dumps(before_pickle1))
>>> after_pickle2 = pickle.loads(pickle.dumps(before_pickle2))
>>> after_pickle1[0] is after_pickle2[0]
False  # After pickling, the lists are not the same
>>> before_pickle1 = data2[0]

>>> before_pickle1[0][0] is before_pickle1[1][0]
True
>>> after_pickle1 = pickle.loads(pickle.dumps(before_pickle1))
>>> after_pickle1[0][0] is after_pickle1[1][0]
True # The lists are still the same after pickling

This simulates the same pickling pool.map does. With data1, all of the de-duplication you had because of the shallow copies are lost, which makes the memory usage much higher.

python multiprocessing peculiar memory management

Answers (1)

Related Questions