Reputation: 624
I have data with missing components, so I've run the mice algorithm (from package mice
). The function returns a .mids object, which I want to split into a training and a test dataset to assess model fit. I want the training and test data to be of .mids form too so that they can be combined with various other functions such as pool
to adjust standard errors according to Rubin's rules.
Here's my attempt where I simply remove rows from the data to get a training set:
library(mice)
data <- mice(nhanes,m=2,maxit=5,seed=1)
set.seed(2)
rand <- (1:nrow(nhanes))*rbinom(nrow(nhanes),size=1,prob=0.7)
train <- data
train$data <- train$data[rand,]
But then if I try to run a model with this data:
pool(with(train, lm(bmi ~ chl + age)))
I encounter an error stating that it's trying to replace 9 rows with 7 (presumably because I decreased the number of rows in train$data without adjusting other stuff as well).
Any help would be much appreciated.
Upvotes: 4
Views: 642
Reputation: 25864
One way is to loop through the complete
datasets and then assign the mira
class to the list, which should allow pool
ing. (which is really what mice:::with.mids
does)
Example without sampling
library(mice)
imp <- mice(nhanes,m=2, maxit=5, seed=1)
# With in-built pooling
pool(with(imp, lm(bmi ~ chl + age)))
# Pooled coefficients:
# (Intercept) chl age
# 21.38496144 0.05975537 -3.40773396
#
# Fraction of information about the coefficients missing due to nonresponse:
# (Intercept) chl age
# 0.6186312 0.1060668 0.7380962
# looping manually
mod <- list(analyses=vector("list", imp$m))
for(i in 1:imp$m){
mod$analyses[[i]] <- lm(bmi ~ chl + age, data=complete(imp, i))
}
class(mod) <- c("mira", "matrix")
pool(mod)
# Pooled coefficients:
# (Intercept) chl age
# 21.38496144 0.05975537 -3.40773396
#
# Fraction of information about the coefficients missing due to nonresponse:
# (Intercept) chl age
# 0.6186312 0.1060668 0.7380962
Seems okay, so add in a sampling procedure
mod <- list(analyses=vector("list", imp$m))
set.seed(1)
for(i in 1:imp$m){
rand <- (1:nrow(nhanes))*rbinom(nrow(nhanes),size=1,prob=0.7)
mod$analyses[[i]] <- lm(bmi ~ chl + age, data=complete(imp, i)[rand,])
}
class(mod) <- c("mira", "matrix")
pool(mod)
# Pooled coefficients:
# (Intercept) chl age
# 21.72382272 0.06468044 -4.23387415
#
# Fraction of information about the coefficients missing due to nonresponse:
# (Intercept) chl age
# 0.1496987 0.4497024 0.6101340
Upvotes: 1