mike
mike

Reputation: 45

parallel random forest in scikit-learn throw a exception

My sklearn version is 0.14.1 with python 2.7 on linux Debian GNU/Linux 7.1

calling:

clf = RandomForestClassifier(min_samples_split = 10, n_estimators = 50 , n_jobs = 1) is ok

while calling:

clf = RandomForestClassifier(min_samples_split = 10, n_estimators = 50 , n_jobs = 5)
clf.fit(train.toarray(), targets)

throw the following exception:

Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.target(self.__args, *self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 342, in handletasks
put(task)
SystemError: NULL result without error in PyObject_Call

After throwing a exception , the random forest's all process are all blocked

Upvotes: 0

Views: 908

Answers (1)

ogrisel
ogrisel

Reputation: 40159

Based on the shape info, the dataset should be ~4GB (for single precision floats). This exception might be caused by a memory exhaustion while multiprocessing is serializing the data to pass it to the worker processes.

To limit the number of memory copies, you can try to replace the sklearn/externals/joblib folder by a symlink or a copy of the joblib subfolder of the master branch of the joblib repo: https://github.com/joblib/joblib

The development version of joblib has been improved to use memory mapping for large input arrays. This might fix your problem.

Edit the memory mapping support has landed in joblib 0.8+ and is included by default in scikit-learn 0.15+

Upvotes: 2

Related Questions