Reputation: 4074
I am trying to compare the performance of 2 R packages, missForest and Hmisc performace in dealing with missing value, when there are more than 50% missing values.
I got testing data in this way:
data("iris")
library(missForest)
iris.mis <- prodNA(iris, noNA = 0.6)
summary(iris.mis)
mis1 <- iris.mis
mis2 <- iris.mis
In missForest, it has mixError()
method which allows you to compare the imputation accuracy with the original data.
# using missForest
missForest_imputed <- missForest(mis1, ntree = 100)
missForest_error <- mixError(missForest_imputed$ximp, mis1, iris)
dim(missForest_imputed$ximp)
missForest_error
Hmisc does not have mixError()
method, I am using its powerful aregImpute()
to do the imputation, like this:
# using Hmisc
library(Hmisc)
hmisc_imputed <- aregImpute(~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Species,
data = mis2, n.impute = 1)
I was hoping to convert the imputed results into a format like missForest_imputed$ximp
, so that I can use mixError()
method. The problem is, in aregImpute()
, no matter I tried n.impute = 1
or n.impute = 5
, I cannot have 150 values for each feature like the original data iris... And the number of values in each feature is also different....
So, is there any way to compare the performance of missForest and Hmisc in dealing with missing values?
Upvotes: 0
Views: 1250
Reputation: 4907
Hmisc::aregImpute
returns the imputed values. For your object named hmisc_imputed
, they can be found in hmisc_imputed$imputed
. However, the imputed
object is a list for each dimension.
If you wish to recreate the equivalent of missForest_imputed$ximp
, you have to do it yourself. To do so, we can use the fact that:
all.equal(as.integer(attr(xx$Sepal.Length, "dimnames")[[1]]), which(is.na(iris.mis$Sepal.Length))) ## returns true
Which I do here:
check_missing <- function(x, hmisc) {
return(all.equal(which(is.na(x)), as.integer(attr(hmisc, "dimnames")[[1]])))
}
get_level_text <- function(val, lvls) {
return(lvls[val])
}
convert <- function(miss_dat, hmisc) {
m_p <- ncol(miss_dat)
h_p <- length(hmisc)
if (m_p != h_p) stop("miss_dat and hmisc must have the same number of variables")
# assume matches for all if 1 matches
if (!check_missing(miss_dat[[1]], hmisc[[1]]))
stop("missing data an imputed data do not match")
for (i in 1:m_p) {
i_factor <- is.factor(miss_dat[[i]])
if (!i_factor) {miss_dat[[i]][which(is.na(miss_dat[[i]]))] <- hmisc[[i]]}
else {
levels_i <- levels(miss_dat[[i]])
miss_dat[[i]] <- as.character(miss_dat[[i]])
miss_dat[[i]][which(is.na(miss_dat[[i]]))] <- sapply(hmisc[[i]], get_level_text, lvls= levels_i)
miss_dat[[i]] <- factor(miss_dat[[i]])
}
}
return(miss_dat)
}
iris.mis2 <- convert(iris.mis, hmisc_imputed$imputed)
mixError
uses RMSE to calculate error-rates, ?mixError
:
Value imputation error. In case of continuous variables only this is the normalized root mean squared error (NRMSE, see 'help(missForest)' for further details). In case of categorical variables onlty this is the proportion of falsely classified entries (PFC). In case of mixed-type variables both error measures are supplied.
To do this on your object from "Part 1" [iris.mis2
], you just need to use the nrmse
function, which is provided in library(missForest)
.
Upvotes: 1