Reputation: 31
In the app I'm developing I use a multiprocessing.BaseManager
to do some heavy and complex computations in parallel with the main process. I use a Manager and not a Pool because these computations are implemented as a class
and needed to be performed only once in a while.
Each time I create a new instance of computing class in the manager, call its' methods, get back the results, then delete the instance and call gc.collect() in the manager.
Here's a pseudo-code to demonstrate the situation:
import gc
from multiprocessing.managers import BaseManager
class MyComputer(object):
def compute(self, args):
#several steps of computations
return huge_list
class MyManager(BaseManager): pass
MyManager.register('MyComputer', MyComputer)
MyManager.register('gc_collect', gc.collect)
if __name__ == '__main__':
manager = MyManager()
manager.start()
#obtain args_list from the configuration file
many_results = []
for args in args_list:
comp = manager.MyComputer()
many_results.append(comp.compute(args))
del comp
manager.gc_collect()
#do somthing with many_results
The result of a computation is big (200Mb-600Mb). And the problem is: according to top
, resident memory used by manager process is growing significantly (by 50Mb to 1Gb) after a computation. It grows much faster if a single comp
object is used in all computations or if manager.gc_collect()
is not called. So I guess the object is indeed deleted and garbage collector works, yet something is still left behind.
Here's a plot of resident memory used by the Manager process during five rounds of computations: https://i.sstatic.net/38tdo.png
My questions are:
Upvotes: 1
Views: 1921
Reputation: 31
After more than a week of research, I'm answering my own questions:
Another important conclusion of the investigation:
Notice these huge memory spikes (https://i.sstatic.net/38tdo.png). They're much larger than the size of any result (~250Mb) produced. This, it turned out, is due to the fact that they were pickled-unpickled in the process. Pickling is a very expencive process; its memory usage have a non-linear dependence on the size of an object to be pickled. So if you (un)pickle an object ~10Mb large, it uses ~12-13Mb, but (un)pickling of ~250Mb uses 800-1000Mb! Thus, in order to pickle a big object (which includes any usage of Pipes, Queues, Connections, shelves, etc.), you need to serialize the process somehow.
Upvotes: 1
Reputation: 35109
It's hard to guess what is the problem. Because memory leaks are always hard to find. I would recommend you to install memory_profiler if you don't have one. It can help you find the memory problem very easily.
Just an example of how to use it:
@profile
def foo():
f = open('CC_2014.csv', 'rb')
lines_f = f.readlines()*10000
f.close()
lines_f = None
foo()
As you can see I added @profile
decorator to the function I suspect has a memory problem.
Then run your script like this:
python -m memory_profiler test.py
And the result is:
Line # Mem usage Increment Line Contents
================================================
1 9.316 MiB 0.000 MiB @profile
2 def foo():
3 9.316 MiB 0.000 MiB f = open('CC_2014.csv', 'rb')
4 185.215 MiB 175.898 MiB lines_f = f.readlines()*10000
5 185.211 MiB -0.004 MiB f.close()
6 9.656 MiB -175.555 MiB lines_f = None
From this output you can easily see which line of eats up a lot of memory.
Upvotes: 0