GridSearchCV : n_jobs in parallel (internals)

Question

How does GridSearchCV with n_jobs being set to a >1 value actually work. Does it create multiple instances of the classifier for each node(computation node) or does it create 1 single classifier which is shared by all the nodes. The reason I am asking is becuase I am using vowpal_wabbits Python wrapper: https://github.com/josephreisinger/vowpal_porpoise/blob/master/vowpal_porpoise/vw.py and see that it opens a subprocess (with stdin, stdout, stderr etc). However when I use this from GridSearch with n_jobs > 1 , I get a broken pipe error after some time and am trying to understand why?

ogrisel · Accepted Answer

n_jobs > 1 will make GridSearchCV use Python's multiprocessing module under the hood. That means that the original estimator instance will be copied (pickled) to be send over to the worker Python processes. All scikit-learn models MUST be picklable. If the vowpal_porpoise opens pipes to a vw subprocess in the constructor object, it has to close them and reopen them around the pickling / unpickling steps by defining custom __getstate__ and __setstate__ methods. Have a look at the Python documentation for more details.

The subprocess should probably be close and reopened upon the call to the set_params method to update the parameters of the model with new parameter values.

It would be easier to not open the subprocess in the constructor and just open it on demand in the fit and predict methods and close the subprocess each time.

GridSearchCV : n_jobs in parallel (internals)

Answers (2)

Related Questions