Henrik Nordmark
Henrik Nordmark

Reputation: 101

Applying mean imputation over a large subset of variables in R

I have a dataset with 498 variables of various kinds numeric, logical, date and others and I have this as a data frame in R with rows for observations and columns for variables. There is a certain subset of these variables for which I would like to replace their missing values with the mean for that variable.

I have coded this very simple function for mean imputation:

impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))

And this works beautifully if I apply to an individual variable say dataset$variableA:

dataset$variableA <- impute.mean(dataset$variableA)

And doing that gives me exactly what I want for the one variable, but because I have a fairly large subset of variables for which I need to do this, I would not want to do this manually by going through each variable that needs imputation.

My first instinct was to use one of the apply functions in R to do this efficiently, however I don't seem to understand how to do this exactly.

A rough first attempt was to use the standard apply:

newdataset <- apply(dataset, 2, impute.mean)

This is obviously a bit crude since it tries to apply the function to all columns including variables which are not numeric, however it seemed like a reasonable starting place even if it might generate a number of warnings. Alas, this method did not work and all my variables remain the same.

I have also done some experimenting with lapply, mapply, ddply but without any success so far.

Ideally, I would like to be able to do something like this:

relevantVariables <- c("variableA1", "variableA2", ..., "variableA293")
newdataset <- magical.apply(dataset, relevantVariables, impute.mean)

Is there some apply function that works in this manner?

Alternatively, is there some other efficient way of going about this?

Upvotes: 4

Views: 1587

Answers (2)

Max Ghenis
Max Ghenis

Reputation: 15793

You can do this efficiently with the data.table package:

SetNAsToMean <- function(dt, vars) {                                                                                                                             
  # Sets NA values of columns to the column means                                                                                                                
  #                                                                                                                                                              
  # Args:                                                                                                                                                        
  #   dt: data.table object to work with                                                                                                                         
  #   vars: vector of column names to replace NAs                                                                                                                
  #                                                                                                                                                              
  # Returns:                                                                                                                                                     
  #   Nothing. Alters data.table in place.                                                                                                                       
  #                                                                                                                                                              
  # Example:                                                                                                                                                     
  #   dt <- data.table(num1 = c(1, NA, 3),                                                                                                                       
  #                    num2 = c(NA, NA, 4),                                                                                                                      
  #                    char1 = rep("a", 3))                                                                                                                      
  #   SetNAsToMean(dt, c("num1", "num2"))                                                                                                                        
  #   # Alternatively, set all numeric columns                                                                                                                    
  #   numerics <- which(lapply(dt, class) == "numeric")                                                                                                           
  #   SetNAsToMean(dt, numerics)
  require(data.table)
  for (var in vars) {                                                                                                                                            
    set(dt, which(is.na(dt[[var]])), var, mean(dt[[var]], na.rm=T))                                                                                              
  }                                                                                                                                                              
}           

Upvotes: 1

Vincent
Vincent

Reputation: 955

Would that satisfy you ?

for (j in 1:length(dataset[1,]))
    {

        if (is.numeric(dataset[,j]))
        {
            for(k in 1:length(dataset[,1]))
            {
                if(is.na(dataset[k,j]))
                {
                    dataset[k,j] <- mean(dataset[,j],na.rm=T)
                }
            }
        }
    }

Upvotes: 0

Related Questions