Reputation: 2730
Below is some code so you can exactly reproduce the problem. Essentially this will explode your memory from 90MB or so to excess of 5GB in a matter of seconds if you don't kill it. With the memory consumption will also come a capped out CPU.
The memory will also be held on to after the sorting function exits.
I appear to only surface this issue if I start with a big master dataframe, slice it up and then do the sorting. If I build a bunch of independent dataframes; this doesn't happen.
def test_sorting(df_list):
counter = 0
total = len(df_list)
for i in range(0,total):
df_list[i].sort_index(inplace=True)
import pandas as pd
import numpy as np
from math import floor
def make_master_df(rows = 250000):
groups = 5
df = pd.DataFrame(np.random.randint(0,100,size=(rows, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
df["timestep"] = pd.Series([floor(x / groups) for x in range(0,rows)])
df["id"] = pd.Series([ x % groups for x in range(0,rows)])
df = df.set_index(["timestep", "id"]).sort_index()
return df
def create_train_test_windows(df, train_size, test_size, slide_size, include_history = True, second_index=False):
n = train_size + test_size
size_multiplier = 1
if(second_index):
size_multiplier = df.index.levels[1].size
n = n * size_multiplier
list_df = None
if(include_history):
df.sort_index(ascending=True, inplace=True)
list_df = [df[:-(i + n)] for i in range(0, df.shape[0], slide_size * size_multiplier)]
list_df.insert(0,df[:])
list_df = list_df[::-1]
else:
raise Exception("excluding history currently not supported.")
list_df = [x for x in list_df if x.shape[0] >= n]
return list_df
master_df = make_master_df()
list_df = create_train_test_windows(master_df, 500, 20, 20, include_history=True, second_index=True)
And this will finally blow up your memory during execution and that memory will be held onto after execution is over.
test_sorting(list_df)
NOTES:
I have noticed that each of the sliced dataframes maintained the full index level size for the first index (timesteps).
I have forced a usage of gc.collect() on every step just to try to be aggressive about it. (didn't work at all).
I have tested as a standalone python script and in an IPython notebook with the same results.
My best guess is that the sliced data frames are not in fact a proper slice; they are bringing a fair amount of baggage with them which is referenced somewhere else.
Any Insights/Assistance is greatly appreciated!
Upvotes: 1
Views: 588
Reputation: 2730
I solved this.
In my posted code above, I am using the following to create my dataframe slices:
list_df = [df[:-(i + n)] for i in range(0, df.shape[0], slide_size * size_multiplier)]
This returns a reference to the original dataframe, which is being held on to and not a "true" copy. Therefor when I sort, it is creating all the required indices with reference to the original dataframe required and why the memory consumption explodes.
To solve this, I am now using the following to slice my dataframe up:
list_df = [df[:-(i + n)].copy() for i in range(0, df.shape[0], slide_size * size_multiplier)]
.copy() returns a full copy with no references to the original dataframe.
Caveats
With the .copy() option, I get to a 30GB memory consumption and during sorts spike upwards of 30.3GB or so. My execution time creating the slices is fractionally slower, but my sort speeds are significantly faster.
Without the .copy() option, I start at about 95MB and end at about 32GB. My slice creation is marginally faster while my sorting is exponentially slower. It also introduces a potential caveat in that depending on how I want to sort each slice and the fact that my slices overlap, I may be un-doing work I previously did.
Summary if you intend to do any fancy work with slices of a larger data frame, from a performance perspective it appears to be much better to copy those slices from both a memory and cpu perspective using the .copy() operator on the slice.
example:
df[1:9].copy()
Upvotes: 1