Reputation: 6939
Summary
adataframe
is a DataFrame
with 800k rows. Naturally, it consumes a bit of memory. When I do this:
adataframe = adataframe.tail(144)
memory is not released.
You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python. However, if I attempt to create a new 800k-row DataFrame
and also keep only a small slice, memory usage grows. If I do it again, it grows again, ad infinitum.
I'm using Debian Jessie's Python 3.4.2 with Pandas 0.18.1 and numpy 1.11.1.
Demonstration with minimal program
With the following program I create a dictionary
data = {
0: a_DataFrame_loaded_from_a_CSV,_only_the_last_144_rows,
1: same_thing,
# ...
9: same_thing,
}
and I monitor memory usage while I'm creating the dictionary. Here it is:
#!/usr/bin/env python3
from resource import getrusage, RUSAGE_SELF
import pandas as pd
def print_memory_usage():
print(getrusage(RUSAGE_SELF).ru_maxrss)
def read_dataframe_from_csv(f):
result = pd.read_csv(f, parse_dates=[0],
names=('date', 'value', 'flags'),
usecols=('date', 'value', 'flags'),
index_col=0, header=None,
converters={'flags': lambda x: x})
result = result.tail(144)
return result
print_memory_usage()
data = {}
for i in range(10):
with open('data.csv') as f:
data[i] = read_dataframe_from_csv(f)
print_memory_usage()
Results
If data.csv
only contains a few rows (e.g. 144, in which case the slicing is redundant), memory usage grows very slowly. But if data.csv
contains 800k rows, the results are similar to these:
52968
153388
178972
199760
225312
244620
263656
288300
309436
330568
349660
(Adding gc.collect()
before print_memory_usage()
does not make any significant difference.)
What can I do about it?
Upvotes: 0
Views: 1263
Reputation: 6939
As @Alex noted, slicing a dataframe only gives you a view to the original frame, but does not delete it; you need to use .copy()
for that. However, even when I used .copy()
, memory usage grew and grew and grew, albeit at a slower rate.
I suspect that this has to do with how Python
, numpy
and pandas
use memory. A dataframe is not a single object in memory; it contains pointers to other objects (especially, in this particular case, to strings, which is the "flags" column). When the dataframe is freed, and these objects are freed, the reclaimed free memory space can be fragmented. Later, when a huge new dataframe is created, it might not be able to use the fragmented space, and new space might need to be allocated. The details depend on many little things, such as the Python
, numpy
and pandas
versions, and the particulars of each case.
Rather than investigating these little details, I decided that reading a huge time series and then slicing it is a no go, and that I must read only the part I need right from the start. I like some of the code I created for that, namely the textbisect module and the FilePart class.
Upvotes: 2
Reputation: 19114
You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python.
Correct that is how maxrss
works (it measures peak memory usage). See here.
So the question then is why is the garbage collector not cleaning up the original DataFrames after they have been subsetted.
I suspect it is because subsetting returns a DataFrame that acts as a proxy to the original one (so values don't need to be copied). This would result in a relatively fast subset operation but also memory leaks like the one you found and weird speed characteristics when setting values.
Upvotes: 1