hodgenovice
hodgenovice

Reputation: 624

Creating training dataset with .mids object in R

I have data with missing components, so I've run the mice algorithm (from package mice). The function returns a .mids object, which I want to split into a training and a test dataset to assess model fit. I want the training and test data to be of .mids form too so that they can be combined with various other functions such as pool to adjust standard errors according to Rubin's rules.

Here's my attempt where I simply remove rows from the data to get a training set:

library(mice)
data <- mice(nhanes,m=2,maxit=5,seed=1)

set.seed(2)
rand <- (1:nrow(nhanes))*rbinom(nrow(nhanes),size=1,prob=0.7)
train <- data
train$data <- train$data[rand,]

But then if I try to run a model with this data:

pool(with(train, lm(bmi ~ chl + age)))

I encounter an error stating that it's trying to replace 9 rows with 7 (presumably because I decreased the number of rows in train$data without adjusting other stuff as well).

Any help would be much appreciated.

Upvotes: 4

Views: 642

Answers (1)

user20650
user20650

Reputation: 25864

One way is to loop through the complete datasets and then assign the mira class to the list, which should allow pooling. (which is really what mice:::with.mids does)

Example without sampling

library(mice)

imp <- mice(nhanes,m=2, maxit=5, seed=1)

# With in-built pooling
pool(with(imp, lm(bmi ~ chl + age)))

# Pooled coefficients:
# (Intercept)         chl         age 
# 21.38496144  0.05975537 -3.40773396 
# 
# Fraction of information about the coefficients missing due to nonresponse: 
# (Intercept)         chl         age 
#   0.6186312   0.1060668   0.7380962 

# looping manually
mod <- list(analyses=vector("list", imp$m))

for(i in 1:imp$m){
  mod$analyses[[i]] <- lm(bmi ~ chl + age, data=complete(imp, i))
}

class(mod) <- c("mira", "matrix")
pool(mod)

# Pooled coefficients:
# (Intercept)         chl         age 
# 21.38496144  0.05975537 -3.40773396 
# 
# Fraction of information about the coefficients missing due to nonresponse: 
# (Intercept)         chl         age 
#   0.6186312   0.1060668   0.7380962 

Seems okay, so add in a sampling procedure

mod <- list(analyses=vector("list", imp$m))

set.seed(1)
for(i in 1:imp$m){
  rand <- (1:nrow(nhanes))*rbinom(nrow(nhanes),size=1,prob=0.7)
  mod$analyses[[i]] <- lm(bmi ~ chl + age, data=complete(imp, i)[rand,])
}

class(mod) <- c("mira", "matrix")
pool(mod)

# Pooled coefficients:
# (Intercept)         chl         age 
# 21.72382272  0.06468044 -4.23387415 
# 
# Fraction of information about the coefficients missing due to nonresponse: 
# (Intercept)         chl         age 
#   0.1496987   0.4497024   0.6101340 

Upvotes: 1

Related Questions