Reputation: 893
When I run the following code, I expect that once foo()
has been executed, the memory used by it (basically to create m
) would be released. However, that is not the case. To release this memory I need to restart the IPython console.
%%cython
# distutils: language = c++
import numpy as np
from libcpp.map cimport map as cpp_map
cdef foo():
cdef:
cpp_map[int,int] m
int i
for i in range(50000000):
m[i] = i
foo()
It will be great if someone could tell me why this is the case and also how to release this memory without restarting the shell. Thanks in advance.
Upvotes: 5
Views: 3281
Reputation: 34326
Effects your are seeing are more or less implementation details of your memory allocator (possible glibc's default allocator). glibc's memory allocator works as follows:
One can tweak when the memory from those arenas is released using mallopt
, but normally an internal heuristic is used which decides, when/if the memory should be returned to OS - which I most confess is kind of black magic to me.
The problem of std::map
(and situation is similar for std::unordered_map
) is, that it doesn't consist of a big chunk of memory which would be returned to OS immediately, but of a lot of small nodes (map is implemented as Red-Black-Tree by libstdc++) - so they all are from those arenas and the heuristic decides not return it to OS.
As we are using glibc's allocator, one could use the non-standard function malloc_trim
to free the memory manually:
%%cython
cdef extern from "malloc.h" nogil:
int malloc_trim(size_t pad)
def return_memory_to_OS():
malloc_trim(0)
and now just call return_memory_to_OS()
after every usage of foo
.
The above solution is quick&dirty but is not portable. What you want to have is an custom allocator which would release the memory back to OS as soon as it is no longer used. That is a lot of work - but luckily we have already such an allocator at hand: CPython's pymalloc - since Python2.5 it returns memory to OS (even if it means sometimes trouble). However, we should also point out a big deficiency of pymalloc - it is not thread-safe, so it can be used only for code with gil!
Using pymalloc-allocator has not only the advantage of returning the memory to OS but also because pymalloc is 8byte-aligned while glibc's allocator is 32byte aligned the resulting memory consumption will be smaller (nodes of map[int,int]
are 40 bytes which will cost only 40.5 bytes with pymalloc (together with overhead) while glibc will needs not less than 64 bytes).
My implementation of the custom allocator follows Nicolai M. Josuttis' example and implements only the really needed functionality:
%%cython -c=-std=c++11 --cplus
cdef extern from *:
"""
#include <cstddef> // std::size_t
#include <Python.h> // pymalloc
template <class T>
class pymalloc_allocator {
public:
// type definitions
typedef T value_type;
typedef T* pointer;
typedef std::size_t size_type;
template <class U>
pymalloc_allocator(const pymalloc_allocator<U>&) throw(){};
pymalloc_allocator() throw() = default;
pymalloc_allocator(const pymalloc_allocator&) throw() = default;
~pymalloc_allocator() throw() = default;
// rebind allocator to type U
template <class U>
struct rebind {
typedef pymalloc_allocator<U> other;
};
pointer allocate (size_type num, const void* = 0) {
pointer ret = static_cast<pointer>(PyMem_Malloc(num*sizeof(value_type)));
return ret;
}
void deallocate (pointer p, size_type num) {
PyMem_Free(p);
}
// missing: destroy, construct, max_size, address
// -
};
// missing:
// bool operator== , bool operator!=
#include <utility>
typedef pymalloc_allocator<std::pair<int, int>> PairIntIntAlloc;
//further helper (not in functional.pxd):
#include <functional>
typedef std::less<int> Less;
"""
cdef cppclass PairIntIntAlloc:
pass
cdef cppclass Less:
pass
from libcpp.map cimport map as cpp_map
def foo():
cdef:
cpp_map[int,int, Less, PairIntIntAlloc] m
int i
for i in range(50000000):
m[i] = i
Now, lion's share of the used memory is returned to OS once foo
is done - on any operating system and memory allocator!
If memory consumption is an issue, one could switch to unorder_map
which needs somewhat less memory. However, as of the moment unordered_map.pxd
doesn't offer access to all template-parameters, so one will have to wrap it manually:
%%cython -c=-std=c++11 --cplus
cdef extern from *:
"""
....
//further helper (not in functional.pxd):
#include <functional>
...
typedef std::hash<int> Hash;
typedef std::equal_to<int> Equal_to;
"""
...
cdef cppclass Hash:
pass
cdef cppclass Equal_to:
pass
cdef extern from "<unordered_map>" namespace "std" nogil:
cdef cppclass unordered_map[T, U, HASH=*,RPED=*, ALLOC=* ]:
U& operator[](T&)
N = 5*10**8
def foo_unordered_pymalloc():
cdef:
unordered_map[int, int, Hash, Equal_to, PairIntIntAlloc] m
int i
for i in range(N):
m[i] = i
Here are some benchmarks, which are obviously not complete, but probably show the direction pretty well (but for N=3e7 instead of N=5e8):
Time PeakMemory
map_default 40.1s 1416Mb
map_default+return_memory 41.8s
map_pymalloc 12.8s 1200Mb
unordered_default 9.8s 1190Mb
unordered_default+return_memory 10.9s
unordered_pymalloc 5.5s 730Mb
The timings were done via %timeit
magic and peak memory usage via via /usr/bin/time -fpeak_used_memory:%M python script_xxx.py
.
I'm somewhat surprised, that pymalloc outperforms the glibc-allocator by so much and also that it seems as if memory allocations are the bottle-neck for the usual map! Maybe this is the price glibc must pay for supporting multi-threading.
unordered_map
is faster and maybe needs less memory (ok, because of the rehashing the last part could be wrong).
Upvotes: 5