Reputation: 677
Assume that I am doing GridSearchCV on a pipeline with [StandardScaler, PCA & Lasso], where the grid search is over 2 values for a PCA parameter and 3 values for a Lasso parameter (thus 6 possible parameter combinations). When doing CV, for a given fold does the algorithm standardize only the train set in that fold (i.e., not include the fold's test set for determining mean/variance of the standardizer) or does it standardize the entire data set outside of the folds (in which case there is only one Standardizing done for the entire grid search procedure)?
Upvotes: 1
Views: 2379
Reputation: 14377
If you are using a sklearn.pipeline.Pipeline
object containing a sklearn.preprocessing.StandardScaler
, a sklearn.decomposition.PCA
and a sklearn.linear_model.Lasso
, and use this pipeline to make a cross-validated estimator using GridSearchCV
, then the StandardScaler
will estimate the parameters for centering and rescaling to unit variance only on the internal train fold.
When evaluating the pipeline on the test fold, the StandardScaler
will use the stored means and standard deviations and subtract the train mean from the test set and divide the result by the train standard deviation.
So the answer is: No, the StandardScaler
will not use the test set in any way to determine mean and variance of the data.
Upvotes: 5