Reputation: 1437
I wonder if we can set up an "optional" step in sklearn.pipeline
. For example, for a classification problem, I may want to try an ExtraTreesClassifier
with AND without a PCA
transformation ahead of it. In practice, it might be a pipeline with an extra parameter specifying the toggle of the PCA
step, so that I can optimize on it via GridSearch
and etc. I don't see such an implementation in sklearn source, but is there any work-around?
Furthermore, since the possible parameter values of a following step in pipeline might depend on the parameters in a previous step (e.g., valid values of ExtraTreesClassifier.max_features
depend on PCA.n_components
), is it possible to specify such a conditional dependency in sklearn.pipeline
and sklearn.grid_search
?
Thank you!
Upvotes: 26
Views: 7651
Reputation: 367
From the docs:
Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to None:
from sklearn.linear_model import LogisticRegression
params = dict(reduce_dim=[None, PCA(5), PCA(10)],
clf=[SVC(), LogisticRegression()],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=params)
Upvotes: 20
Reputation: 40159
Pipeline
steps cannot currently be made optional in a grid search but you could wrap the PCA
class into your own OptionalPCA
component with a boolean parameter to turn off PCA when requested as a quick workaround. You might want to have a look at hyperopt to setup more complex search spaces. I think it has good sklearn integration to support this kind of patterns by default but I cannot find the doc anymore. Maybe have a look at this talk.
For the dependent parameters problem, GridSearchCV
supports trees of parameters to handle this case as demonstrated in the documentation.
Upvotes: 18