Sklearn Pipeline: is there leakage /bias when including scaling in the pipeline?

Question

In machine learning, you split the data into training data and test data.

In cross validation, you split the training data into training sets and validation set.

"And if scaling is required, at each iteration of the CV, the means and standard deviations of the training sets (not the entire training data) excluding the validation set are computed and used to scale the validation set, so that the scaling part never include information from the validation set. "

My question is when I include scaling in the pipeline, at each CV iteration, is scaling computed from the smaller training sets (excluding validation set) or the entire training data (including validation set)? Because if it computes means and std from entire training data , then this will lead to estimation bias in the validation set.

Sklearn Pipeline: is there leakage /bias when including scaling in the pipeline?

Answers (1)

Related Questions