Manish Goel
Manish Goel

Reputation: 893

Does Cython container not release memory?

When I run the following code, I expect that once foo() has been executed, the memory used by it (basically to create m) would be released. However, that is not the case. To release this memory I need to restart the IPython console.

%%cython
# distutils: language = c++

import numpy as np
from libcpp.map cimport map as cpp_map

cdef foo():
    cdef:
        cpp_map[int,int]    m
        int i
    for i in range(50000000):
        m[i] = i

foo()

It will be great if someone could tell me why this is the case and also how to release this memory without restarting the shell. Thanks in advance.

Upvotes: 5

Views: 3281

Answers (1)

ead
ead

Reputation: 34326

Effects your are seeing are more or less implementation details of your memory allocator (possible glibc's default allocator). glibc's memory allocator works as follows:

  • requests for small memory sizes are satisfied from arenas, which grow/whose number grows as needed.
  • request for large memory are directly taken from OS but also directly returned to OS as soon as they are freed.

One can tweak when the memory from those arenas is released using mallopt, but normally an internal heuristic is used which decides, when/if the memory should be returned to OS - which I most confess is kind of black magic to me.

The problem of std::map (and situation is similar for std::unordered_map) is, that it doesn't consist of a big chunk of memory which would be returned to OS immediately, but of a lot of small nodes (map is implemented as Red-Black-Tree by libstdc++) - so they all are from those arenas and the heuristic decides not return it to OS.

As we are using glibc's allocator, one could use the non-standard function malloc_trim to free the memory manually:

%%cython

cdef extern from "malloc.h" nogil:
     int malloc_trim(size_t pad)

def return_memory_to_OS():
    malloc_trim(0)

and now just call return_memory_to_OS() after every usage of foo.


The above solution is quick&dirty but is not portable. What you want to have is an custom allocator which would release the memory back to OS as soon as it is no longer used. That is a lot of work - but luckily we have already such an allocator at hand: CPython's pymalloc - since Python2.5 it returns memory to OS (even if it means sometimes trouble). However, we should also point out a big deficiency of pymalloc - it is not thread-safe, so it can be used only for code with gil!

Using pymalloc-allocator has not only the advantage of returning the memory to OS but also because pymalloc is 8byte-aligned while glibc's allocator is 32byte aligned the resulting memory consumption will be smaller (nodes of map[int,int] are 40 bytes which will cost only 40.5 bytes with pymalloc (together with overhead) while glibc will needs not less than 64 bytes).

My implementation of the custom allocator follows Nicolai M. Josuttis' example and implements only the really needed functionality:

%%cython -c=-std=c++11 --cplus

cdef extern from *:
    """
    #include <cstddef>   // std::size_t
    #include <Python.h>  // pymalloc

    template <class T>
    class pymalloc_allocator {
     public:
       // type definitions
       typedef T        value_type;
       typedef T*       pointer;
       typedef std::size_t    size_type;

       template <class U>
       pymalloc_allocator(const pymalloc_allocator<U>&) throw(){};
       pymalloc_allocator() throw() = default;
       pymalloc_allocator(const pymalloc_allocator&) throw() = default;
       ~pymalloc_allocator() throw() = default;

       // rebind allocator to type U
       template <class U>
       struct rebind {
           typedef pymalloc_allocator<U> other;
       };

       pointer allocate (size_type num, const void* = 0) {
           pointer ret = static_cast<pointer>(PyMem_Malloc(num*sizeof(value_type)));
           return ret;
       }

       void deallocate (pointer p, size_type num) {
           PyMem_Free(p);
       }

       // missing: destroy, construct, max_size, address
       //  -
   };

   // missing:
   //  bool operator== , bool operator!= 

    #include <utility>
    typedef pymalloc_allocator<std::pair<int, int>> PairIntIntAlloc;

    //further helper (not in functional.pxd):
    #include <functional>
    typedef std::less<int> Less;
    """
    cdef cppclass PairIntIntAlloc:
        pass
    cdef cppclass Less:
        pass


from libcpp.map cimport map as cpp_map

def foo():
    cdef:
        cpp_map[int,int, Less, PairIntIntAlloc] m
        int i
    for i in range(50000000):
        m[i] = i

Now, lion's share of the used memory is returned to OS once foo is done - on any operating system and memory allocator!


If memory consumption is an issue, one could switch to unorder_map which needs somewhat less memory. However, as of the moment unordered_map.pxd doesn't offer access to all template-parameters, so one will have to wrap it manually:

%%cython -c=-std=c++11 --cplus

cdef extern from *:
    """
    ....

    //further helper (not in functional.pxd):
    #include <functional>
    ...
    typedef std::hash<int> Hash;
    typedef std::equal_to<int> Equal_to;
    """
    ...
    cdef cppclass Hash:
        pass
    cdef cppclass Equal_to:
        pass

cdef extern from "<unordered_map>" namespace "std" nogil:
    cdef cppclass unordered_map[T, U, HASH=*,RPED=*, ALLOC=* ]:
        U& operator[](T&)

N = 5*10**8

def foo_unordered_pymalloc():
    cdef:
        unordered_map[int, int, Hash, Equal_to, PairIntIntAlloc] m
        int i
    for i in range(N):
        m[i] = i

Here are some benchmarks, which are obviously not complete, but probably show the direction pretty well (but for N=3e7 instead of N=5e8):

                                   Time           PeakMemory

map_default                        40.1s             1416Mb
map_default+return_memory          41.8s 
map_pymalloc                       12.8s             1200Mb

unordered_default                   9.8s             1190Mb
unordered_default+return_memory    10.9s
unordered_pymalloc                  5.5s              730Mb

The timings were done via %timeit magic and peak memory usage via via /usr/bin/time -fpeak_used_memory:%M python script_xxx.py.

I'm somewhat surprised, that pymalloc outperforms the glibc-allocator by so much and also that it seems as if memory allocations are the bottle-neck for the usual map! Maybe this is the price glibc must pay for supporting multi-threading.

unordered_map is faster and maybe needs less memory (ok, because of the rehashing the last part could be wrong).

Upvotes: 5

Related Questions