beginner_
beginner_

Reputation: 7622

Pipeline and GridSearch - pipeline fully recomputed?

I'm doing a grid search with a pipeline. Part of the pipeline is feature selection which I do in the pipeline so it is applied to a specific CV-fold and not the full data.

Pipeline:

clf = Pipeline([
    ('low_variance', VarianceThreshold(threshold=0)),
    ('feature_importance', 
        SelectMaxFeaturesFromModel(RandomForestClassifier(), threshold='0.75*median')),
    ('classification', xgb)
])

This pipline is then used in the grid search.

My question is about how this is internally handled? Is the pipeline just rerun fully for every CV-fold for every iteration? I'm asking because the output in reality is constant as each CV-fold will have the exact same output for each iteration. eg there are exactly k different outputs in case of k-fold CV. (well given the usage of random forest the output might not even be constant but constant = same features per fold is what one wants)

So instead of running this as many times as there are iterations, 1 precomputed run would suffice. Does such a feature exist? or do I need to create my own Selector? How would such a selector know which CV-fold currently is running?

UPDATE:

Maybe this is just RTFM?

The documentation doesn't clearly explain it's caching one instance for each CV-fold but I assume that is the case?

Upvotes: 1

Views: 376

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12582

To the original question, yes, every pipeline step will be recomputed for every combination of hyperparameters and fold.

To the update, yes, you can cache the pipeline steps to prevent this (though then you have the cost of writing/reading from files, so this should only be done for expensive transformers). A better description of how that works is in the User Guide:

[the cache is used instead of refitting] if the parameters and input data are identical

so that yes, you'll have separate caches for each cv-fold, but not for each hyperparameter if you aren't searching over transformer hyperparameters.

Upvotes: 0

Related Questions