How to impute the missing value on the test set?

Question

I am now handling the missing data. I have missing data in my test and train set. I got a little confused about how to deal with the missing data in the test set. If I am imputing by using the "mean" method, should I use the mean calculated from the train set or the test set if I want to impute the missing value in the test set. Thank you for helping me!

Matus Dubrava · Accepted Answer

In general, you should not compute mean or anything other from test set (best way of thinking about test set is that it simply doesn't exist, at least until you have already trained your model).

Build a transformation pipeline that can handle all the necessary preprocessing steps (impute missing data, standardize, perform desired feature engineering, dimensionality reduction...) on training set and when a new observation comes (we should treat test set as just a new observations that are unavailable during training) apply this pipeline transformations on that new data.

How to impute the missing value on the test set?

Answers (2)

Related Questions