Reputation: 1017
I noticed that predict() will only create predictions on complete cases. I had included medianImpute
in the preProcess options, such as the following:
train(outcome ~ .,
data = df,
method = "rf",
tuneLength = 5,
preProcess = c("YeoJohnson", "center", "scale", "medianImpute"),
metric = 'ROC',
trControl = train_ctrl)
}
Does this mean that I should be doing imputation for the missing values before training the set? If not, I am unable to create a prediction for all cases in the test set. I had read in Dr. Kuhn's book that pre-processing should occur during cross validation... Thanks!
Upvotes: 0
Views: 2854
Reputation: 37889
If you are using medianImpute
then it definitely needs to happen before the training set otherwise even if you impute the test set with medianImpute
the results would be wrong.
Take the following extreme case as an example:
You have only one independent variable X which constists of numbers 1 to 100. Imagine the extreme case of splitting the data set into a 50% test set and a 50% training set, with numbers 1 to 50 being in the test set and numbers 51 to 100 in the training set.
> median(1:50) #test set median
[1] 25.5
> median(51:100) #training set median
[1] 75.5
Using your code (caret's train function) the missing values in the training set would be replaced with 75.5. This will create three major problems:
(medianImpute)
for the test set because missing values in the test set would be replaced with 25.5Therefore, the best thing to do is to account for the missing data before the training set's creation.
Hope this helps!
Upvotes: 5