Hamson
Hamson

Reputation: 43

Multiple imputation in r using "missForest" on categorical variables

I have survey dataset with NAs in several columns. THerefore, I decided to perform multiple imputation using the "missForest" package to impute the missing values. This was not a problem, however I noticed after checking my data that many of the imputed values are numeric with decimal values in columns that were previously factors.

I assume that missForest requires the columns to be numeric (it requires a data.matrix for x) in order for it to perform imputation.

The NRMSE is quite good and the means of the columns with imputed values are similar to the columns with NAs.

I plan to use the dataset with the imputed values for a multilevel linear regresssion and would have converted the factor columns to numeric anyways.

Should these imputed values that are numeric with decimal places pose a problem?

finalmatrix <- data.matrix(final)
set.seed(666)
impforest <- missForest(finalmatrix, variablewise = TRUE, parallelize = 
"forests")

Upvotes: 0

Views: 3872

Answers (1)

Steffen Moritz
Steffen Moritz

Reputation: 7730

I don't know your data or your code, but missForest is definitely able to deal with mixed type data. (and does not automatically convert these)

This is an example from the missForest manual:

## Nonparametric missing value imputation on mixed-type data:
## Take a look at iris definitely has a variable that is a factor 
library(missForest)
data(iris)
summary(iris)

## The data contains four continuous and one categorical variable.
## Artificially produce missing values using the 'prodNA' function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)

## Impute missing values providing the complete matrix for
## illustration. Use 'verbose' to see what happens between iterations:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)


## Here are the final results
iris.imp

##As can be seen here it still has the factor column
str(iris.imp$ximp)

Upvotes: 0

Related Questions