Loaded pickle data is much larger in memory than on disk and seems to leak. (Python 2.7)

Question

I'm having a memory issue. I have a pickle file that I wrote with the Python 2.7 cPickle module. This file is 2.2GB on disk. It contains a dictionary of various nestings of dictionaries, lists, and numpy arrays.

When I load this file (again using cPickle on Python 2.7) the python process ends up using 5.13GB of memory. Then, if I delete the reference to the loaded data the data usage drops by 2.79GB. At the end of the program there is still another 2.38GB that has not been cleaned up.

Is there some cache or memoization table that cPickle keeps in the backend? Where is this extra data coming from? Is there a way to clear it?

There are no custom objects in the loaded cPickle, just dicts, lists, and numpy arrays. I can't wrap my head around why its behaving this way.

The offending example

Here is a simple script I wrote to demonstrate the behavior:

from six.moves import cPickle as pickle
import time
import gc
import utool as ut

print('Create a memory tracker object to snapshot memory usage in the program')
memtrack = ut.MemoryTracker()

print('Print out how large the file is on disk')
fpath = 'tmp.pkl'
print(ut.get_file_nBytes_str('tmp.pkl'))

print('Report memory usage before loading the data')
memtrack.report()
print(' Load the data')
with open(fpath, 'rb') as file_:
    data = pickle.load(file_)

print(' Check how much data it used')
memtrack.report()

print(' Delete the reference and check again')
del data
memtrack.report()

print('Check to make sure the system doesnt want to clean itself up')

print(' This never does anything. I dont know why I bother')
time.sleep(1)
gc.collect()
memtrack.report()

time.sleep(10)
gc.collect()

for i in range(10000):
    time.sleep(.001)

print(' Check one more time')
memtrack.report()

And here is its output

Create a memory tracker object to snapshot memory usage in the program
[memtrack] +----
[memtrack] | new MemoryTracker(Memtrack Init)
[memtrack] | Available Memory = 12.41 GB
[memtrack] | Used Memory      = 39.09 MB
[memtrack] L----
Print out how large the file is on disk
2.00 GB
Report memory usage before loading the data
[memtrack] +----
[memtrack] | diff(avail) = 0.00 KB
[memtrack] | [] diff(used) = 12.00 KB
[memtrack] | Available Memory = 12.41 GB
[memtrack] | Used Memory      = 39.11 MB
[memtrack] L----
 Load the data
 Check how much data it used
[memtrack] +----
[memtrack] | diff(avail) = 5.09 GB
[memtrack] | [] diff(used) = 5.13 GB
[memtrack] | Available Memory = 7.33 GB
[memtrack] | Used Memory      = 5.17 GB
[memtrack] L----
 Delete the reference and check again
[memtrack] +----
[memtrack] | diff(avail) = -2.80 GB
[memtrack] | [] diff(used) = -2.79 GB
[memtrack] | Available Memory = 10.12 GB
[memtrack] | Used Memory      = 2.38 GB
[memtrack] L----
Check to make sure the system doesnt want to clean itself up
 This never does anything. I dont know why I bother
[memtrack] +----
[memtrack] | diff(avail) = 40.00 KB
[memtrack] | [] diff(used) = 0.00 KB
[memtrack] | Available Memory = 10.12 GB
[memtrack] | Used Memory      = 2.38 GB
[memtrack] L----
 Check one more time
[memtrack] +----
[memtrack] | diff(avail) = -672.00 KB
[memtrack] | [] diff(used) = 0.00 KB
[memtrack] | Available Memory = 10.12 GB
[memtrack] | Used Memory      = 2.38 GB
[memtrack] L----

Sanity Check 1 (garbage collection)

As a sanity check here is a script that allocates the same amount of data and then deletes it, the processes cleans itself up perfectly.

Here is the script:

import numpy as np
import utool as ut

memtrack = ut.MemoryTracker()
data = np.empty(2200 * 2 ** 20, dtype=np.uint8) + 1
print(ut.byte_str2(data.nbytes))
memtrack.report()
del data
memtrack.report()

And here is the output

[memtrack] +----
[memtrack] | new MemoryTracker(Memtrack Init)
[memtrack] | Available Memory = 12.34 GB
[memtrack] | Used Memory      = 39.08 MB
[memtrack] L----
2.15 GB
[memtrack] +----
[memtrack] | diff(avail) = 2.15 GB
[memtrack] | [] diff(used) = 2.15 GB
[memtrack] | Available Memory = 10.19 GB
[memtrack] | Used Memory      = 2.19 GB
[memtrack] L----
[memtrack] +----
[memtrack] | diff(avail) = -2.15 GB
[memtrack] | [] diff(used) = -2.15 GB
[memtrack] | Available Memory = 12.34 GB
[memtrack] | Used Memory      = 39.10 MB
[memtrack] L----

Sanity Check 2 (ensuring types)

Just to do a sanity check that there are no custom types in this list this these are the set of types that occur in this structure. data itself is a dict with the following keys: ['maws_lists', 'int_rvec', 'wx_lists', 'aid_to_idx', 'agg_flags', 'agg_rvecs', 'gamma_list', 'wx_to_idf', 'aids', 'fxs_lists', 'wx_to_aids']. The following script is specific to the particular nesting of this structure, but it exhaustively shows the types used in this container:

print(data.keys())
type_set = set()
type_set.add(type(data['int_rvec']))
type_set.add(type(data['wx_to_aids']))
type_set.add(type(data['wx_to_idf']))
type_set.add(type(data['gamma_list']))
type_set.update(set([n2.dtype for n1 in  data['agg_flags'] for n2 in n1]))
type_set.update(set([n2.dtype for n1 in  data['agg_rvecs'] for n2 in n1]))
type_set.update(set([n2.dtype for n1 in  data['fxs_lists'] for n2 in n1]))
type_set.update(set([n2.dtype for n1 in  data['maws_lists'] for n2 in n1]))
type_set.update(set([n1.dtype for n1 in  data['wx_lists']]))
type_set.update(set([type(n1) for n1 in  data['aids']]))
type_set.update(set([type(n1) for n1 in  data['aid_to_idx'].keys()]))
type_set.update(set([type(n1) for n1 in  data['aid_to_idx'].values()]))

The output of type set is

{bool,
 dtype('bool'),
 dtype('uint16'),
 dtype('int8'),
 dtype('int32'),
 dtype('float32'),
 NoneType,
 int}

which shows that all sequences end up resolving to None, a standard python type or a standard numpy type. You'll have to trust me that the iterable types are all lists and dicts.

In short my question is:

Why does loading a 2GB pickle file end up with 5GB of memory used in RAM?
Why does only 2.5GB/5GB get cleaned up when the recently loaded data is garbage collected?
Is there anything that can be done to reclaim this lost memory?

kindall · Accepted Answer

One possible culprit here is that Python, by design, overallocates data structures like lists and dictionaries to make appending to them faster, because memory allocations are slow. For example, on a 32-bit Python, an empty dictionary has a sys.getsizeof() of 36 bytes. Add one element and it becomes 52 bytes. It remains 52 bytes until it has five elements, at which point it becomes 68 bytes. So, clearly, when you appended the first element, Python allocated enough memory for four, and then it allocated enough memory for four more when you added the fifth element (LEELOO DALLAS). As the list grows, the amount of padding added grows faster and faster: essentially you double the memory allocation of the list each time you fill it up.

So I expect there is something like that going on, since the pickle protocol does not appear to store the length of pickled objects, at least for the Python data types, so it is essentially reading one list or dictionary item at a time and appending it, and Python is growing the object as items are added just as described above. Depending on how the size of the objects shake out when you unpickle your data, you might have a lot of extra space left over in your lists and dictionaries. (Not sure how numpy objects are stored, however; they might be more compact.)

Potentially there are also some temporary objects being allocated as well, which would help explain how the memory usage got that large.

Now, when you make a copy of a list or dictionary, Python knows exactly how many items it has and can allocate exactly the right amount of memory for the copy. If a hypothetical 5-element list x is allocated 68 bytes because it is expected to grow to 8 elements, the copy x[:] is allocated 56 bytes because that's exactly the right amount. So you could give that a shot with one of your more sizable objects after loading, and see if it helps noticeably.

But it might not. Python doesn't necessarily release memory back to the OS when objects are destroyed. Instead, it may hold on to the memory in case it needs to allocate more objects of the same kind (which is pretty likely), because reusing memory you already have is less costly than releasing that memory only to re-allocate it later. So although Python might not have given the memory back to the OS, that doesn't mean there's a leak. It's available for use by the rest of your script, the OS just can't see it. There isn't a way to force Python to give it back in this case.

I don't know what utool is (I found a Python package by that name but it doesn't seem to have a MemoryTracker class) but depending on what it's measuring, it might be showing the OS's take on it, not Python's. In this case, what you're seeing is essentially your script's peak memory use, since Python is holding onto that memory in case you need it for something else. If you never use it, it will eventually be swapped out by the OS and the physical RAM will be given to some other process that needs it.

Bottom line, the amount of memory your script is using is not a problem to be solved in itself, and in general is not something you will need to concern yourself with. (That's why you're using Python in the first place!) Does your script work, and does it run quickly enough? Then you're fine. Python and NumPy are both mature and widely used software; the likelihood of finding a true, previously-undetected memory leak of this size in something as frequently used as the pickle library is pretty slim.

If available, it would be interesting to compare your script's memory usage with the amount of memory used by the script that writes the data.

Loaded pickle data is much larger in memory than on disk and seems to leak. (Python 2.7)

The offending example

Sanity Check 1 (garbage collection)

Sanity Check 2 (ensuring types)

Answers (1)

Related Questions