csankar69
csankar69

Reputation: 677

GridSearchCV on a pipeline with standardscaler, PCA & lasso

Assume that I am doing GridSearchCV on a pipeline with [StandardScaler, PCA & Lasso], where the grid search is over 2 values for a PCA parameter and 3 values for a Lasso parameter (thus 6 possible parameter combinations). When doing CV, for a given fold does the algorithm standardize only the train set in that fold (i.e., not include the fold's test set for determining mean/variance of the standardizer) or does it standardize the entire data set outside of the folds (in which case there is only one Standardizing done for the entire grid search procedure)?

Upvotes: 1

Views: 2379

Answers (1)

eickenberg
eickenberg

Reputation: 14377

If you are using a sklearn.pipeline.Pipeline object containing a sklearn.preprocessing.StandardScaler, a sklearn.decomposition.PCA and a sklearn.linear_model.Lasso, and use this pipeline to make a cross-validated estimator using GridSearchCV, then the StandardScaler will estimate the parameters for centering and rescaling to unit variance only on the internal train fold.

When evaluating the pipeline on the test fold, the StandardScaler will use the stored means and standard deviations and subtract the train mean from the test set and divide the result by the train standard deviation.

So the answer is: No, the StandardScaler will not use the test set in any way to determine mean and variance of the data.

Upvotes: 5

Related Questions