MickLyno
MickLyno

Reputation: 5

Partial imputation with missforest - combining the selected columns with original dataset

apologizes for a rather simple question, but I have not successfully resolved this simple issue.

I am aiming to only impute selected columns with missforest. The model then outputs only the selected columns in the data set. What is the most elegant method to combine these with the original dataset including all of the columns as there is not one unique key for join per row.

I tried following YohanK's instructions from this post: partial imputation with missForest

Example:

data(iris)

set.seed(81) iris.mis <- prodNA(iris, noNA = 0.1)

imputedData <- missForest(iris.mis[c( 1, 2, 3, 4)], verbose = T)

dataset <- data.frame(iris[1], imputedData)

this results in this error:

Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class ‘"missForest"’ to a data.frame

Thank you for your help in advance!

Br, Mick

Upvotes: 0

Views: 794

Answers (1)

Steffen Moritz
Steffen Moritz

Reputation: 7730

Ideally you give the whole dataset to missForest, since even if you just want to impute certain columns, the other columns provide useful information in order to produce good imputation results.

With missForest you would implement your imputation of e.g. only the first column like this:

  1. You save yourself the dataset with missing values first and do not override it.

  2. You perform the imputation with missForest on the whole dataset.

  3. Afterwards you replace the columns you wanted to impute in the initial dataset with the imputed ones you got from missForest.

For your example this works like this:

library(missForest)
data(iris)
set.seed(81) 

# Here you artificially introduce NAs to the iris data
iris.mis <- prodNA(iris, noNA = 0.1)

# Do the imputation with missForest on the whole data
imp_missforest <- missForest(iris.mis, verbose = T)

# imp_missforest is a missForest object use $imp to get all the imputed values
iris.imp <- imp_missforest$ximp

# Replace the desired columns in the initial iris.mis dataset with the columns from iris.imp
iris.mis$Sepal.Length <- iris.imp$Sepal.Length

# As you can see, only the first column in iris.mis in now imputed and you can decide yourself, what to do with the remaining columns
iris.mis

In my example you only imputed the first column Sepal.Length. The rest in iris.mis remains the same.

Upvotes: 1

Related Questions