cbKCnSTL
cbKCnSTL

Reputation: 21

missForest won't impute my categorical variables

I am not an experienced coder and have just started learning R the past few weeks to help with some work related to my PhD. Here is the issue:

I have been trying unsuccessfully for many, many hours to impute missing values into a data set using the missForest package in R. Below is a representative example of the problem I'm having with a fabricated data set.

The data set contains numeric values that are categorical. Upon importing I use the following code to set the class to "factor"


    data <- read.csv("~Data.csv", colClasses = c(rep('factor',3)))

>data  
a   b   c  
1   2   3  
4   5      
7   8   9

To verify the class was set properly I run:

missForest::varClass(data) 

returns:

[1] "factor" "factor" "factor"

I then attempt to impute and view the data but I get the original data set back with the datapoint still missing instead of having an imputed value inserted.

    data.imp <- missForest(data)
    data.imp$ximp

a   b   c  
1   2   3  
4   5      
7   8   9  

The above example shows how I am importing the data and converting it to factor and attempting to impute the missing data. The below example is a reproducible example the creates the same problem.

The below example should be reproducible in R

I am using R version 3.5.3 (2019-03-11)

#install and load the missForest package and library
install.packages("missForest")
library(missForest)
#create the test data frame with a missing value in column c
a <- c("1","4","7")
b <- c("2","5","8")
c <- c("3","","9")
data.test <- data.frame(a,b,c)
#print the data
data.test
#view the class of the data to ensure it is "factor"
missForest::varClass(data.test)
#create the imputed data frame using missForest
data.test.imp <- missForest(data.test)
#print the imputed data frame
data.test.imp$ximp

The above code returns the following with the value in column c still missing

> data.test
  a b c
1 1 2 3
2 4 5  
3 7 8 9
> missForest::varClass(data.test)
[1] "factor" "factor" "factor"
> data.test.imp <- missForest(data.test)
  missForest iteration 1 in progress...done!
  missForest iteration 2 in progress...done!
> data.test.imp$ximp
  a b c
1 1 2 3
2 4 5  
3 7 8 9

If I convert all the data to numeric, it will impute values into the missing data points, although those imputed values are decimals and all my data are integers, but it works none the less...

The real data set I'm using is much larger but I am having the exact same issue with it.

Further, if I follow the example in the missForest manual using the iris data set everything works as it should. But if I download the same data set from UCI repository and manually remove a categorical data point and try to run the same code it doesn't work.

I'm sure there is something minor that I am missing but after hours of trying to figure this out I'm stuck.

Upvotes: 2

Views: 2196

Answers (1)

jay.sf
jay.sf

Reputation: 72603

This really seems to be a minor issue. In your data.test you have empty strings which need to be coded as missing.

You can test that with str:

str(data.test)
# 'data.frame': 3 obs. of  3 variables:
# $ a: Factor w/ 3 levels "1","4","7": 1 2 3
# $ b: Factor w/ 3 levels "2","5","8": 1 2 3
# $ c: Factor w/ 3 levels "","3","9": 2 1 3

You see, the levels of variable c contains "" which is also coded as a category.

You can easily fix that by doing

data.test[data.test == ""] <- NA
data.test
#   a b    c
# 1 1 2    3
# 2 4 5 <NA>
# 3 7 8    9

Now, missForest works:

data.test.imp <- missForest::missForest(data.test)
data.test.imp$ximp
#   a b c
# 1 1 2 3
# 2 4 5 9
# 3 7 8 9

Upvotes: 3

Related Questions