Why number of sub processes in python influence the memory consumption of the sub processes?

Question

I have a very interesting case in python.

The main process creates N sub-processes. The subprocesses are classes that inherit from multiprocessing.Process.

Now, when the number of the sub processes is 10, each one of the sub processes consumes about 15M of residential memory. However, when I increase the number of the sub processes to 100, the residential memory consumption of each one of the sub processes jumps to about 50M!!!

Anybody can explain this jump in memory / suggest how to avoid it?

Here is the structure of the sub-process class:

class MySubProcess(multiprocessing.Process):
    def __init__(self, sub_process_number):
        multiprocessing.Process.__init__(self, target=self.go)

        self.m_sub_process_number = sub_process_number


    def go(self):
        self.m_config = global_config
        while (True):
        ....

Thanks a lot!!!

Cartroo · Accepted Answer

When I attempt a simple example where each subprocess does nothing but time.sleep() I don't see this behaviour, so I don't believe it's something intrinsic to the multiprocessing module.

My best guess would be the memory duplication functionality of fork(), which multiprocessing is likely using under the hood. The semantics of forking a new process on Unix call for the entire memory space of the parent process to be duplicated into the child. So, let's say you're creating a list of these MySubProcess structures prior to starting any of them. This list will then be duplicated into the address space of each child process, so when you look at the resident size of each of those processes it will appear significantly larger (assuming your structures occupy a non-trivial amount of memory).

Also, any other memory you allocate before launching the child processes will be duplicated, but the list of instances was the main thing I could think of which would increase in size as you allocate more processes. Depending on your code, there may be other data structures which scale with the number of processes (e.g. work queues).

If you del everything you don't need in the context of each child you might find their size goes back down but this is dependent on quite a complex interaction between the Python allocators and the system memory allocator so this is by no means certain. Essentially, Python may keep freed memory around for re-use, and even if the Python interpreter doesn't then the system allocators may do. In short, this is probably not worth the effort - see the end of my answer for more info.

This isn't nearly so bad as it seems, however, because Linux (and other modern Unix variants) uses so-called copy-on-write semantics to make sure the behaviour of fork() isn't so hideously inefficient. Essentially, the child processes keep a reference to the same memory pages as the parent process - as long as neither process changes anything, the memory isn't actually duplicated, although if you sum the memory usage figures from ps or top for both processes it will be counted twice because their per-process approach isn't smart enough to notice the sharing of pages. This isn't dissimilar to having multiple hard links to the same underlying file, if that's something you've ever encountered.

Once a process writes to a memory page, it is then copied (hence the name "copy on write") and will therefore use actual physical memory. The amount of extra memory required is pretty difficult to predict in this case because it involves mapping Python data structures all the way down through to physical memory pages. However, the principle itself is what's important.

You can test whether my theory is correct by using the free utility to display overall system memory usage and compare the figures between the two cases - if I'm right, you'll see some increased memory in the 100 subprocess case, but not as much as examining the memory usage of each process would suggest. Don't forget to use the numbers from the second line (i.e. the -/+ buffers/cache line) because this will smooth out any changes in the filesystem cache between your two tests.

Assuming this is right, your best bet is to try and start your child processes as early as possible, before the parent process has allocated much memory. Aside from your best efforts at this, however, you probably don't need to worry too much about it - even if pages are copied on write, they won't be accessed by the child process and hence will be swapped out to disk as required and likely never swapped back in, so not causing much of a performance hit (unless your platform doesn't have any swap).

One final note - in practice there's probably little point in creating any more worker processes than there are cores on the machine, which is typically not more than around 8 or perhaps 16 unless you're on extremely specialised hardware. If you create too many processes then you're probably wasting more time scheduling them all than you're getting benefit - you can't get any more parallelisation than physical cores whatever you do (although hyperthreading complicates this slightly).

This other SO question might provide some more useful information.

Why number of sub processes in python influence the memory consumption of the sub processes?

Answers (1)

Related Questions