Reputation: 794
I am trying to run scikit learn random forest algorithm on the mnist handwritten digits dataset. During the training of the algorithm the system goes into a Memory Error. Please tell me what should I do to fix this issue.
CPU Statistics: Intel Core 2 Duo with 4GB RAM
The shape of dataset is 60000, 784. the complete error as on the linux terminal is as follows:
> File "./reducer.py", line 53, in <module>
> main() File "./reducer.py", line 38, in main
> clf = clf.fit(data,labels) #training the algorithm File "/usr/lib/pymodules/python2.7/sklearn/ensemble/forest.py", line 202,
> in fit
> for i in xrange(n_jobs)) File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 409, in
> __call__
> self.dispatch(function, args, kwargs) File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 295, in
> dispatch
> job = ImmediateApply(func, args, kwargs) File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 101, in
> __init__
> self.results = func(*args, **kwargs) File "/usr/lib/pymodules/python2.7/sklearn/ensemble/forest.py", line 73, in
> _parallel_build_trees
> sample_mask=sample_mask, X_argsorted=X_argsorted) File "/usr/lib/pymodules/python2.7/sklearn/tree/tree.py", line 476, in fit
> X_argsorted=X_argsorted) File "/usr/lib/pymodules/python2.7/sklearn/tree/tree.py", line 357, in
> _build_tree
> np.argsort(X.T, axis=1).astype(np.int32).T) File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line
> 680, in argsort
> return argsort(axis, kind, order) MemoryError
Upvotes: 5
Views: 10331
Reputation: 5849
Please train Random Forest with a single tree and check the tree depth. By default, the full tree is grown in scikit-learn. The depth can be very large so even few trees in the Forest can use a lot of memory. You can try to limit the depth of the tree with max_depth
hyper-parameter.
I run the experiment when I reduced the depth of trees from 42 (the average depth of tree in the forest) to 6. The memory decreases 66 times while the performance was slightly better (about 4%).
Upvotes: 0
Reputation: 3289
One solution can be to use the most recent version (0.19) of scikit-learn. In the change log, they mentioned in the bug fixes section (indeed, there is a major improvement):
Fixed excessive memory usage in prediction for random forests estimators. #8672 by Mike Benfield.
You can install this version by using:
pip3 install scikit-learn==0.19.0
Upvotes: 1
Reputation: 1
.ensemble
methodsWith all due respect to other opinions, scikit-learn 0.16.1
does not proof itself to have the "nasty" X
, y
replicas, cited for some early versions.
Due to some other reasons, I have spent rather a long time on the RandomForestRegressor()
hyperparameter's landscape, incl. their memory footprint problems.
As of 0.16.1
, there was less than 2% increase in the parallel-joblib memory requirements above a default n_jobs = 1
to { 2, 3, ... }
Co-father of recent scikit-learn
releases, @glouppe, posted a marvelous & insight-full presentation (2014-Aug, rel. 0.15.0), incl. comparisons with R-based and other known RandomForest frameworks.
IMHO, pages 25+ speak about techniques, that increase speed, incl. the np.asfortranarray(...)
, however these seem to me ( without any experimental proof ) as just internal directions shared inside the Scikit-learn development team, rather than a recommendation for us, the mortals, who live in the "outer world".
Yes, that matters. Some additional Feature-engineering efforts & testing might be in place if not doing a full-scale FeatureSET vector bagging. Your learner seems to be the Classifier case, so go deeper into:
max_features
et almkswap
+ swapon
if needed after tuning the learner in 1.After another round of testing, there has appeared one interesting observation.
While a .set_params( n_jobs = -1 ).fit( X, y )
configuration was used successfully on training the RandomForestRegressor()
the ugly surprise came later, once trying to use .predict( X_observed )
on such pre-trained object.
There a similar map/reduce-bound memory issue was reported (with 0.17.0 now).
Nevertheless, the same .set_params( n_jobs = 1 ).predict( X_observed )
solo-job was well served on .predict()
Upvotes: 2
Reputation: 363627
Either set n_jobs=1
or upgrade to the bleeding edge version of scikit-learn. The problem is that the currently released version uses multiple processes to fit trees in parallel, which means that the data (X
and y
) need to be copied to these processes. The next release will use threads instead of processes, so the tree learners share memory.
Upvotes: 4