Cherry Wu
Cherry Wu

Reputation: 4074

compare R packages missForest and Hmisc performance

I am trying to compare the performance of 2 R packages, missForest and Hmisc performace in dealing with missing value, when there are more than 50% missing values.

I got testing data in this way:

data("iris")
library(missForest)
iris.mis <- prodNA(iris, noNA = 0.6)
summary(iris.mis)

mis1 <- iris.mis
mis2 <- iris.mis

In missForest, it has mixError() method which allows you to compare the imputation accuracy with the original data.

# using missForest
missForest_imputed <- missForest(mis1, ntree = 100)
missForest_error <- mixError(missForest_imputed$ximp, mis1, iris)
dim(missForest_imputed$ximp)
missForest_error

Hmisc does not have mixError() method, I am using its powerful aregImpute() to do the imputation, like this:

# using Hmisc
library(Hmisc)
hmisc_imputed <- aregImpute(~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Species, 
                        data = mis2, n.impute = 1)

I was hoping to convert the imputed results into a format like missForest_imputed$ximp, so that I can use mixError() method. The problem is, in aregImpute(), no matter I tried n.impute = 1 or n.impute = 5, I cannot have 150 values for each feature like the original data iris... And the number of values in each feature is also different....

So, is there any way to compare the performance of missForest and Hmisc in dealing with missing values?

Upvotes: 0

Views: 1250

Answers (1)

alexwhitworth
alexwhitworth

Reputation: 4907

Part 1

Hmisc::aregImpute returns the imputed values. For your object named hmisc_imputed, they can be found in hmisc_imputed$imputed. However, the imputed object is a list for each dimension.

If you wish to recreate the equivalent of missForest_imputed$ximp, you have to do it yourself. To do so, we can use the fact that:

all.equal(as.integer(attr(xx$Sepal.Length, "dimnames")[[1]]), which(is.na(iris.mis$Sepal.Length))) ## returns true

Which I do here:

check_missing <- function(x, hmisc) {
  return(all.equal(which(is.na(x)), as.integer(attr(hmisc, "dimnames")[[1]])))
}

get_level_text <- function(val, lvls) {
  return(lvls[val])
}

convert <- function(miss_dat, hmisc) {
  m_p <- ncol(miss_dat)
  h_p <- length(hmisc)
  if (m_p != h_p) stop("miss_dat and hmisc must have the same number of variables")
  # assume matches for all if 1 matches
  if (!check_missing(miss_dat[[1]], hmisc[[1]]))
    stop("missing data an imputed data do not match")

  for (i in 1:m_p) {
    i_factor <- is.factor(miss_dat[[i]])
    if (!i_factor) {miss_dat[[i]][which(is.na(miss_dat[[i]]))] <- hmisc[[i]]}
    else {
      levels_i <- levels(miss_dat[[i]])
      miss_dat[[i]] <- as.character(miss_dat[[i]])
      miss_dat[[i]][which(is.na(miss_dat[[i]]))] <- sapply(hmisc[[i]], get_level_text, lvls= levels_i)
      miss_dat[[i]] <- factor(miss_dat[[i]])
    }
  }
  return(miss_dat)
}

iris.mis2 <- convert(iris.mis, hmisc_imputed$imputed)

Part 2

mixError uses RMSE to calculate error-rates, ?mixError:

Value imputation error. In case of continuous variables only this is the normalized root mean squared error (NRMSE, see 'help(missForest)' for further details). In case of categorical variables onlty this is the proportion of falsely classified entries (PFC). In case of mixed-type variables both error measures are supplied.

To do this on your object from "Part 1" [iris.mis2], you just need to use the nrmse function, which is provided in library(missForest).

Upvotes: 1

Related Questions