Pandas & Scikit: memory usage when slicing DataFrame

Question

I have a largeish DataFrame, loaded from a csv file (about 300MB).

From this, I'm extracting a few dozen features to use in a RandomForestClassifier: some of the features are simply derived from columns in the data, for example:

 feature1 = data["SomeColumn"].apply(len)
 feature2 = data["AnotherColumn"]

And others are created as new DataFrames from numpy arrays, using the index on the original dataframe:

feature3 = pandas.DataFrame(count_array, index=data.index)

All these features are then joined into one DataFrame:

features = feature1.join(feature2) # etc...

And I train a random forest classifier:

classifier = RandomForestClassifier(
    n_estimators=100,
    max_features=None,
    verbose=2,
    compute_importances=True,
    n_jobs=n_jobs,
    random_state=0,
)
classifier.fit(features, data["TargetColumn"])

The RandomForestClassifier works fine with these features, building a tree takes O(hundreds of megabytes of memory). However: if after loading my data, I take a small subset of it:

data_slice = data[data['somecolumn'] > value]

Then building a tree for my random forest suddenly takes many gigabytes of memory - even though the size of the features DataFrame is now O(10%) of the original.

I can believe that this might be because a sliced view on the data doesn't permit further slices to be done efficiently (though I don't see how I this could propagate into the features array), so I've tried:

data = pandas.DataFrame(data_slice, copy=True)

but this doesn't help.

Why would taking a subset of the data massively increase memory use?
Is there some other way to compact / rearrange a DataFrame which might make things more efficient again?

ogrisel · Accepted Answer

The RandomForestClassifier is copying the dataset several times in memory, especially when n_jobs is large. We are aware of those issues and it's a priority to fix them:

I am currently working on a subclass of the multiprocessing.Pool class of the standard library that will do no memory copy when numpy.memmap instances are passed to the subprocess workers. This will make it possible to share the memory of the source dataset + some precomputed datastructures between the workers. Once this is fixed I will close this issue on the github tracker.
There is an ongoing refactoring that will further decrease the memory usage of RandomForestClassifier by two. However the current state of the refactoring is twice as slow as the master, hence further work is still required.

However none of those fixes will make it to 0.12 release that is scheduled for release next week. Most probably they will be done for 0.13 (planned for release in 3 to 4 months) but offcourse will be available in the master branch a lot sooner.

Pandas & Scikit: memory usage when slicing DataFrame

Answers (1)

Related Questions

Pandas &amp; Scikit: memory usage when slicing DataFrame

Answers (1)

Related Questions

Pandas & Scikit: memory usage when slicing DataFrame