Reputation: 25543
I have a largeish DataFrame
, loaded from a csv file (about 300MB).
From this, I'm extracting a few dozen features to use in a RandomForestClassifier
: some of the features are simply derived from columns in the data, for example:
feature1 = data["SomeColumn"].apply(len)
feature2 = data["AnotherColumn"]
And others are created as new DataFrame
s from numpy arrays, using the index on the original dataframe:
feature3 = pandas.DataFrame(count_array, index=data.index)
All these features are then joined into one DataFrame
:
features = feature1.join(feature2) # etc...
And I train a random forest classifier:
classifier = RandomForestClassifier(
n_estimators=100,
max_features=None,
verbose=2,
compute_importances=True,
n_jobs=n_jobs,
random_state=0,
)
classifier.fit(features, data["TargetColumn"])
The RandomForestClassifier
works fine with these features, building a tree takes O(hundreds of megabytes of memory). However: if after loading my data, I take a small subset of it:
data_slice = data[data['somecolumn'] > value]
Then building a tree for my random forest suddenly takes many gigabytes of memory - even though the size of the features DataFrame
is now O(10%) of the original.
I can believe that this might be because a sliced view on the data doesn't permit further slices to be done efficiently (though I don't see how I this could propagate into the features array), so I've tried:
data = pandas.DataFrame(data_slice, copy=True)
but this doesn't help.
DataFrame
which might make things more efficient again?Upvotes: 2
Views: 2202
Reputation: 40169
The RandomForestClassifier
is copying the dataset several times in memory, especially when n_jobs
is large. We are aware of those issues and it's a priority to fix them:
I am currently working on a subclass of the multiprocessing.Pool
class of the standard library that will do no memory copy when numpy.memmap
instances are passed to the subprocess workers. This will make it possible to share the memory of the source dataset + some precomputed datastructures between the workers. Once this is fixed I will close this issue on the github tracker.
There is an ongoing refactoring that will further decrease the memory usage of RandomForestClassifier
by two. However the current state of the refactoring is twice as slow as the master, hence further work is still required.
However none of those fixes will make it to 0.12 release that is scheduled for release next week. Most probably they will be done for 0.13 (planned for release in 3 to 4 months) but offcourse will be available in the master branch a lot sooner.
Upvotes: 4