Reputation: 185
I'm using scikit-learn Random Forest to fit a training data (~30mb) and my laptop keeps crashing running of out application memory. The test data is a few times bigger than the training data. Using Macbook Air 2GHz 8GB memory.
What are some of the ways to deal with this?
rf = RandomForestClassifier(n_estimators = 100, n_jobs=4)
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rf, X_train_a, y_train, cv=20, scoring='roc_auc'))
Upvotes: 3
Views: 5777
Reputation: 5839
I was dealing with ~4MB dataset and Random Forest from scikit-learn with default hyper-parameters was ~50MB (so more than 10 times of the data). By setting the max_depth = 6
the memory consumption decrease 66 times. The performance of shallow Random Forest on my dataset improved!
I write down this experiment in the blog post.
From my experience, in the case of regression tasks the memory usage can grow even much more, so it is important to control the tree depth. The tree depth can be controlled directly with max_depth
or by tuning: min_samples_split
, min_samples_leaf
, min_weight_fraction_leaf
, max_features
, max_leaf_nodes
.
The memory of the Random Forest can be of course controlled with number of trees in the ensemble.
Upvotes: 1
Reputation: 5390
Your best choice is to tune the arguments.
n_jobs=4
This makes the computer compute four train-test cycles simultaneously. Different Python jobs run in separate processes, thus the full dataset is also copied. Try to reduce n_jobs
to 2 or 1 to save memory. n_jobs==4
uses four times the memory n_jobs==1
uses.
cv=20
This splits the data into 20 pieces and the code does 20 train-test iterations. This means that the training data is the size of 19 pieces of the original data. You can quite safely reduce it to 10, however your accuracy estimate might get worse. It won't save much memory, but makes runtime faster.
n_estimators = 100
Reducing this will save little memory, but it will make the algorithm run faster as the random forest will contain fewer trees.
To sum up, I'd recommend reducing n_jobs
to 2 to save the memory (2-fold increase in runtime). To compensate runtime, I'd suggest changing cv
to 10 (2-fold savings in runtime). If that does not help, change n_jobs
to 1 and also reduce the number of estimators to 50 (extra two times faster processing).
Upvotes: 7