user1769197
user1769197

Reputation: 2213

Sklearn Pipeline: is there leakage /bias when including scaling in the pipeline?

In machine learning, you split the data into training data and test data.

In cross validation, you split the training data into training sets and validation set.

"And if scaling is required, at each iteration of the CV, the means and standard deviations of the training sets (not the entire training data) excluding the validation set are computed and used to scale the validation set, so that the scaling part never include information from the validation set. "

My question is when I include scaling in the pipeline, at each CV iteration, is scaling computed from the smaller training sets (excluding validation set) or the entire training data (including validation set)? Because if it computes means and std from entire training data , then this will lead to estimation bias in the validation set.

Upvotes: 0

Views: 163

Answers (1)

gtancev
gtancev

Reputation: 253

I thought about this, too, and although I think that scaling with the full data leaks some information from training data into validation data, I don't think it's that severe.

One one side, you shuffle the data anyway, and you assume that the distributions in all sets are the same, and so you expect means and standard deviations to be the same. (Of course, this is only theoretic (law of large numbers).)

On the other side, even if the means and stds are different, this difference will not be siginificant.

In my optinion, yes, you might have some bias, but it should be negligible.

Upvotes: 1

Related Questions