Reputation: 5415
I'm building a predictive model and am using the mice
package for imputing NAs in my training set. Since I need to re-use the same imputation scheme for my test set, how can I re-apply it to my test data?
# generate example data
set.seed(333)
mydata <- data.frame(a = as.logical(rbinom(100, 1, 0.5)),
b = as.logical(rbinom(100, 1, 0.2)),
c = as.logical(rbinom(100, 1, 0.8)),
y = as.logical(rbinom(100, 1, 0.6)))
na_a <- as.logical(rbinom(100, 1, 0.3))
na_b <- as.logical(rbinom(100, 1, 0.3))
na_c <- as.logical(rbinom(100, 1, 0.3))
mydata$a[na_a] <- NA
mydata$b[na_b] <- NA
mydata$c[na_c] <- NA
# create train/test sets
library(caret)
inTrain <- createDataPartition(mydata$y, p = .8, list = FALSE)
train <- mydata[ inTrain, ]
test <- mydata[-inTrain, ]
# impute NAs in train set
library(mice)
imp <- mice(train, method = "logreg")
train_imp <- complete(imp)
# apply imputation scheme to test set
test_imp <- unknown_function(test, imp$unknown_data)
Upvotes: 9
Views: 3309
Reputation: 41
As of mice::mice version 3.12.0 contains the ignore parameter which will cover most use cases.
Simply pass it a vector with TRUE for all rows that should be used during training and FALSE for all rows that should only be imputed (but not used during training).
imp.ignore <- mice(data, ignore = c(rep(FALSE, 99), TRUE), maxit = 5, m = 2, seed = 1)
Upvotes: 4
Reputation: 2323
prockenschaub has created a lovely function for that, called mice.reuse()
library(mice)
library(scorecard)
# function to impute new observations based on the previous imputation model
source("https://raw.githubusercontent.com/prockenschaub/Misc/master/R/mice.reuse/mice.reuse.R")
# split data into train and test
data_list <- split_df(airquality, y = NULL, ratio = 0.75, seed = 186)
imp <- mice(data = data_list$train,
seed = 500,
m = 5,
method = "pmm",
print = FALSE)
# impute test data based on train imputation model
test_imp <- mice.reuse(imp, data_list$test, maxit = 1)
Upvotes: 3
Reputation: 11
When you are training a model you cannot use test data in any sense. Therefore you cannot impute with MICE the complete dataset before splitting. It is necessary to use only train data also for the imputation of the test data
Upvotes: 0
Reputation:
Run mice imputation on the combined dataset and only then split it into train and test, fit the machine learning classifier on the train set and then on the test set.
Upvotes: -5