Timm S.
Timm S.

Reputation: 5415

R - mice - machine learning: re-use imputation scheme from train to test set

I'm building a predictive model and am using the mice package for imputing NAs in my training set. Since I need to re-use the same imputation scheme for my test set, how can I re-apply it to my test data?

# generate example data
set.seed(333)
mydata <- data.frame(a = as.logical(rbinom(100, 1, 0.5)),
                     b = as.logical(rbinom(100, 1, 0.2)),
                     c = as.logical(rbinom(100, 1, 0.8)),
                     y = as.logical(rbinom(100, 1, 0.6)))

na_a <- as.logical(rbinom(100, 1, 0.3))
na_b <- as.logical(rbinom(100, 1, 0.3))
na_c <- as.logical(rbinom(100, 1, 0.3))
mydata$a[na_a] <- NA
mydata$b[na_b] <- NA
mydata$c[na_c] <- NA

# create train/test sets
library(caret)
inTrain <- createDataPartition(mydata$y, p = .8, list = FALSE)
train <- mydata[ inTrain, ] 
test <-  mydata[-inTrain, ]

# impute NAs in train set
library(mice)
imp <- mice(train, method = "logreg")
train_imp <- complete(imp)

# apply imputation scheme to test set
test_imp <- unknown_function(test, imp$unknown_data)

Upvotes: 9

Views: 3309

Answers (4)

arjunbazinga
arjunbazinga

Reputation: 41

As of mice::mice version 3.12.0 contains the ignore parameter which will cover most use cases.

Simply pass it a vector with TRUE for all rows that should be used during training and FALSE for all rows that should only be imputed (but not used during training).

imp.ignore <- mice(data, ignore = c(rep(FALSE, 99), TRUE), maxit = 5, m = 2, seed = 1)

Upvotes: 4

camnesia
camnesia

Reputation: 2323

prockenschaub has created a lovely function for that, called mice.reuse()

library(mice)
library(scorecard)

# function to impute new observations based on the previous imputation model
source("https://raw.githubusercontent.com/prockenschaub/Misc/master/R/mice.reuse/mice.reuse.R")

# split data into train and test
data_list <- split_df(airquality, y = NULL, ratio = 0.75, seed = 186)

imp <- mice(data = data_list$train, 
            seed = 500, 
            m = 5,
            method = "pmm",
            print = FALSE)


# impute test data based on train imputation model
test_imp <- mice.reuse(imp, data_list$test, maxit = 1)

Upvotes: 3

Joan Vila
Joan Vila

Reputation: 11

When you are training a model you cannot use test data in any sense. Therefore you cannot impute with MICE the complete dataset before splitting. It is necessary to use only train data also for the imputation of the test data

Upvotes: 0

user8508347
user8508347

Reputation:

Run mice imputation on the combined dataset and only then split it into train and test, fit the machine learning classifier on the train set and then on the test set.

Upvotes: -5

Related Questions