Antonis Christofides
Antonis Christofides

Reputation: 6939

Memory is not released when taking a small slice of a DataFrame

Summary

adataframe is a DataFrame with 800k rows. Naturally, it consumes a bit of memory. When I do this:

adataframe = adataframe.tail(144)

memory is not released.

You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python. However, if I attempt to create a new 800k-row DataFrame and also keep only a small slice, memory usage grows. If I do it again, it grows again, ad infinitum.

I'm using Debian Jessie's Python 3.4.2 with Pandas 0.18.1 and numpy 1.11.1.

Demonstration with minimal program

With the following program I create a dictionary

data = {
    0:  a_DataFrame_loaded_from_a_CSV,_only_the_last_144_rows,
    1:  same_thing,
    # ...
    9: same_thing,
}

and I monitor memory usage while I'm creating the dictionary. Here it is:

#!/usr/bin/env python3

from resource import getrusage, RUSAGE_SELF

import pandas as pd


def print_memory_usage():
    print(getrusage(RUSAGE_SELF).ru_maxrss)


def read_dataframe_from_csv(f):
    result = pd.read_csv(f, parse_dates=[0],
                        names=('date', 'value', 'flags'),
                        usecols=('date', 'value', 'flags'),
                        index_col=0, header=None,
                        converters={'flags': lambda x: x})
    result = result.tail(144)
    return result


print_memory_usage()
data = {}
for i in range(10):
    with open('data.csv') as f:
        data[i] = read_dataframe_from_csv(f)
    print_memory_usage()

Results

If data.csv only contains a few rows (e.g. 144, in which case the slicing is redundant), memory usage grows very slowly. But if data.csv contains 800k rows, the results are similar to these:

52968
153388
178972
199760
225312
244620
263656
288300
309436
330568
349660

(Adding gc.collect() before print_memory_usage() does not make any significant difference.)

What can I do about it?

Upvotes: 0

Views: 1263

Answers (2)

Antonis Christofides
Antonis Christofides

Reputation: 6939

As @Alex noted, slicing a dataframe only gives you a view to the original frame, but does not delete it; you need to use .copy() for that. However, even when I used .copy(), memory usage grew and grew and grew, albeit at a slower rate.

I suspect that this has to do with how Python, numpy and pandas use memory. A dataframe is not a single object in memory; it contains pointers to other objects (especially, in this particular case, to strings, which is the "flags" column). When the dataframe is freed, and these objects are freed, the reclaimed free memory space can be fragmented. Later, when a huge new dataframe is created, it might not be able to use the fragmented space, and new space might need to be allocated. The details depend on many little things, such as the Python, numpy and pandas versions, and the particulars of each case.

Rather than investigating these little details, I decided that reading a huge time series and then slicing it is a no go, and that I must read only the part I need right from the start. I like some of the code I created for that, namely the textbisect module and the FilePart class.

Upvotes: 2

Alex
Alex

Reputation: 19114

You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python.

Correct that is how maxrss works (it measures peak memory usage). See here.

So the question then is why is the garbage collector not cleaning up the original DataFrames after they have been subsetted.

I suspect it is because subsetting returns a DataFrame that acts as a proxy to the original one (so values don't need to be copied). This would result in a relatively fast subset operation but also memory leaks like the one you found and weird speed characteristics when setting values.

Upvotes: 1

Related Questions