impiyush
impiyush

Reputation: 794

Scikit Learn RandomForest Memory Error

I am trying to run scikit learn random forest algorithm on the mnist handwritten digits dataset. During the training of the algorithm the system goes into a Memory Error. Please tell me what should I do to fix this issue.

CPU Statistics: Intel Core 2 Duo with 4GB RAM

The shape of dataset is 60000, 784. the complete error as on the linux terminal is as follows:

> File "./reducer.py", line 53, in <module>
>     main()   File "./reducer.py", line 38, in main
>     clf = clf.fit(data,labels) #training the algorithm   File "/usr/lib/pymodules/python2.7/sklearn/ensemble/forest.py", line 202,
> in fit
>     for i in xrange(n_jobs))   File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 409, in
> __call__
>     self.dispatch(function, args, kwargs)   File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 295, in
> dispatch
>     job = ImmediateApply(func, args, kwargs)   File "/usr/lib/pymodules/python2.7/joblib/parallel.py", line 101, in
> __init__
>     self.results = func(*args, **kwargs)   File "/usr/lib/pymodules/python2.7/sklearn/ensemble/forest.py", line 73, in
> _parallel_build_trees
>     sample_mask=sample_mask, X_argsorted=X_argsorted)   File "/usr/lib/pymodules/python2.7/sklearn/tree/tree.py", line 476, in fit
>     X_argsorted=X_argsorted)   File "/usr/lib/pymodules/python2.7/sklearn/tree/tree.py", line 357, in
> _build_tree
>     np.argsort(X.T, axis=1).astype(np.int32).T)   File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line
> 680, in argsort
>     return argsort(axis, kind, order) MemoryError

Upvotes: 5

Views: 10331

Answers (4)

pplonski
pplonski

Reputation: 5849

Please train Random Forest with a single tree and check the tree depth. By default, the full tree is grown in scikit-learn. The depth can be very large so even few trees in the Forest can use a lot of memory. You can try to limit the depth of the tree with max_depth hyper-parameter.

I run the experiment when I reduced the depth of trees from 42 (the average depth of tree in the forest) to 6. The memory decreases 66 times while the performance was slightly better (about 4%).

Upvotes: 0

Sanchit
Sanchit

Reputation: 3289

One solution can be to use the most recent version (0.19) of scikit-learn. In the change log, they mentioned in the bug fixes section (indeed, there is a major improvement):

 Fixed excessive memory usage in prediction for random forests estimators. #8672 by Mike Benfield.

You can install this version by using:

pip3 install scikit-learn==0.19.0

Upvotes: 1

user3666197
user3666197

Reputation: 1

Scikit-learn Dev Team improved a lot both memory management & performance on .ensemble methods

With all due respect to other opinions, scikit-learn 0.16.1 does not proof itself to have the "nasty" X, y replicas, cited for some early versions.

Due to some other reasons, I have spent rather a long time on the RandomForestRegressor() hyperparameter's landscape, incl. their memory footprint problems.

As of 0.16.1, there was less than 2% increase in the parallel-joblib memory requirements above a default n_jobs = 1 to { 2, 3, ... }

Co-father of recent scikit-learn releases, @glouppe, posted a marvelous & insight-full presentation (2014-Aug, rel. 0.15.0), incl. comparisons with R-based and other known RandomForest frameworks.

IMHO, pages 25+ speak about techniques, that increase speed, incl. the np.asfortranarray(...), however these seem to me ( without any experimental proof ) as just internal directions shared inside the Scikit-learn development team, rather than a recommendation for us, the mortals, who live in the "outer world".

Regression or Classification?

Yes, that matters. Some additional Feature-engineering efforts & testing might be in place if not doing a full-scale FeatureSET vector bagging. Your learner seems to be the Classifier case, so go deeper into:

  1. experiment on non-default settings for max_features et al
  2. use O/S services to handle larger virtual memory mkswap + swapon if needed after tuning the learner in 1.

Addendum

After another round of testing, there has appeared one interesting observation.

While a .set_params( n_jobs = -1 ).fit( X, y ) configuration was used successfully on training the RandomForestRegressor() the ugly surprise came later, once trying to use .predict( X_observed ) on such pre-trained object.

There a similar map/reduce-bound memory issue was reported (with 0.17.0 now).

Nevertheless, the same .set_params( n_jobs = 1 ).predict( X_observed ) solo-job was well served on .predict()

Upvotes: 2

Fred Foo
Fred Foo

Reputation: 363627

Either set n_jobs=1 or upgrade to the bleeding edge version of scikit-learn. The problem is that the currently released version uses multiple processes to fit trees in parallel, which means that the data (X and y) need to be copied to these processes. The next release will use threads instead of processes, so the tree learners share memory.

Upvotes: 4

Related Questions