Alessandro
Alessandro

Reputation: 865

Parallel error with GridSearchCV, works fine with other methods

I am encounteringt the following problems using GridSearchCV: it gives me a parallel error while using n_jobs > 1. At the same time n_jobs > 1 works fine with the single models like RadonmForestClassifier.

Below is a simple working example showing the errors:

train = np.random.rand(100,10)
targ = np.random.randint(0,2,100)

clf = ensemble.RandomForestClassifier(n_jobs = 2)
clf.fit(train,targ)
train = np.random.rand(100,10)
targ = np.random.randint(0,2,100)
​
clf = ensemble.RandomForestClassifier(n_jobs = 2)
clf.fit(train,targ)
Out[349]: RandomForestClassifier(bootstrap=True, class_weight=None,     criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=2, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

this example works fine.

Meanwhile the following doesn't work:

clf = ensemble.RandomForestClassifier()
param_grid = {'n_estimators': [10,20]}
grid_s= model_selection.GridSearchCV(clf, param_grid=param_grid_gb,n_jobs=-1,verbose=1)
grid_s.fit(train, targ)

And gives the following error:

Fitting 3 folds for each of 2 candidates, totalling 6 fits

ImportErrorTraceback (most recent call last)
<ipython-input-351-b8bb45396026> in <module>()
      2 param_grid = {'n_estimators': [10,20]}
      3 grid_s= model_selection.GridSearchCV(clf, param_grid=param_grid_gb,n_jobs=-1,verbose=1)
----> 4 grid_s.fit(train, targ)

/root/anaconda3/envs/python2/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in fit(self, X, y, groups)
    943             train/test set.
    944         """
--> 945         return self._fit(X, y, groups, ParameterGrid(self.param_grid))
    946 
    947 

/root/anaconda3/envs/python2/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in _fit(self, X, y, groups, parameter_iterable)
    562                                   return_times=True, return_parameters=True,
    563                                   error_score=self.error_score)
--> 564           for parameters in parameter_iterable
    565           for train, test in cv_iter)
    566 

/root/anaconda3/envs/python2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    726         self._aborting = False
    727         if not self._managed_backend:
--> 728             n_jobs = self._initialize_backend()
    729         else:
    730             n_jobs = self._effective_n_jobs()

/root/anaconda3/envs/python2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in _initialize_backend(self)
    538         try:
    539             return self._backend.configure(n_jobs=self.n_jobs, parallel=self,
--> 540                                            **self._backend_args)
    541         except FallbackToBackend as e:
    542             # Recursively initialize the backend in case of requested fallback.

/root/anaconda3/envs/python2/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in configure(self, n_jobs, parallel, **backend_args)
    297         if already_forked:
    298             raise ImportError(
--> 299                 '[joblib] Attempting to do parallel computing '
    300                 'without protecting your import on a system that does '
    301                 'not support forking. To use parallel-computing in a '

ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == '__main__'". Please see the joblib documentation on Parallel for more information

Upvotes: 5

Views: 7435

Answers (3)

Alaa M.
Alaa M.

Reputation: 5273

What worked for me was changing the parallel backend:

from sklearn.utils import parallel_backend

with parallel_backend('multiprocessing'):  # 'multiprocessing' / 'threading'
    # GridSearchCV code...

See acceptable backend values here.

Answer taken from here.

Upvotes: 0

Lylo
Lylo

Reputation: 21

Maybe this could be still relevant for some!

I tried this only using Anaconda on a Windows 10 machine:

I had the same problem within my environment, with the following code section:

parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]

grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = 'accuracy', cv = 10, n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

I did not find a lot on the internet, so I thought maybe I should update the joblib class. And surprise - joblib was not installed in my specific environment. After I installed and updated it - it worked perfectly. With n_jobs = -1 AND n_jobs = 2.

Upvotes: 2

Abhishek Thakur
Abhishek Thakur

Reputation: 16995

I think you are using windows. You need to wrap the grid search in a function and then call inside __name__ == '__main__'. Joblib parallel n_jobs=-1 determines the number of jobs to use which in parallel doesn't work on windows all the time.

Try wrapping grid search in a function:

def somefunction():
    clf = ensemble.RandomForestClassifier()
    param_grid = {'n_estimators': [10,20]}
    grid_s= model_selection.GridSearchCV(clf,   param_grid=param_grid_gb,n_jobs=-1,verbose=1)
    grid_s.fit(train, targ)
    return grid_s

if __name__ == '__main__':
    somefunction()

Or:

if __name__ == '__main__':
    clf = ensemble.RandomForestClassifier()
    param_grid = {'n_estimators': [10,20]}
    grid_s= model_selection.GridSearchCV(clf,   param_grid=param_grid_gb,n_jobs=-1,verbose=1)
    grid_s.fit(train, targ)

Upvotes: 13

Related Questions