Ray
Ray

Reputation: 123

How to impute the missing value on the test set?

I am now handling the missing data. I have missing data in my test and train set. I got a little confused about how to deal with the missing data in the test set. If I am imputing by using the "mean" method, should I use the mean calculated from the train set or the test set if I want to impute the missing value in the test set. Thank you for helping me!

Upvotes: 8

Views: 9370

Answers (2)

Szymon Maszke
Szymon Maszke

Reputation: 24701

You should use train mean for that. You should never infer information from test dataset as that's an information leak.

Calculating mean of test dataset would give your algoritm info about mean of it (obviously) and would probably falsely improve its score on said.

In real life you would usually have no way to calculate mean of missing data anyway (think of single incoming example with missing values).

Upvotes: 4

Matus Dubrava
Matus Dubrava

Reputation: 14462

In general, you should not compute mean or anything other from test set (best way of thinking about test set is that it simply doesn't exist, at least until you have already trained your model).

Build a transformation pipeline that can handle all the necessary preprocessing steps (impute missing data, standardize, perform desired feature engineering, dimensionality reduction...) on training set and when a new observation comes (we should treat test set as just a new observations that are unavailable during training) apply this pipeline transformations on that new data.

Upvotes: 11

Related Questions