John Snow
John Snow

Reputation: 153

Mean imputation issue with data.table

Trying to impute missing values in all numeric rows using this loop:

for(i in 1:ncol(df)){
  if (is.numeric(df[,i])){
    df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
  }
}

When data.table package is not attached then code above is working as it should. Once I attach data.table package, then the behaviour changes and it shows me the error:

Error in `[.data.table`(df, , i) : 
  j (the 2nd argument inside [...]) is a single symbol but column name 'i' 
is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This 
difference to data.frame is deliberate and explained in FAQ 1.1.

I tried '..i' and 'with=FALSE' everywhere but with no success. Actually it has not passed even first is.numeric condition.

Upvotes: 1

Views: 1049

Answers (2)

Oliver
Oliver

Reputation: 8572

An alternative answer to this question, i came up with while sitting with a similar problem on a large scale. One might be interested in avoiding for loops by using the [.data.table method.

DF[i, j, by, on, ...]

First we'll create a function that can perform the imputation

 impute_na <- function(x, val = mean, ...){
   if(!is.numeric(x))return(x)
   na <- is.na(x)
   if(is.function(val))
     val <- val(x[!na])
   if(!is.numeric(val)||length(val)>1)
     stop("'val' needs to be either a function or a single numeric value!")
   x[na] <- val
   x
 }

To perform the imputation on the data frame, one could create and evaluate an expression in the data.table environment, but for simplicity of example here we'll overwrite using <-

DF <- DF[, lapply(.SD, impute_na)]

This will impute the mean across all numeric columns, and keep any non-numeric columns as is. If we wished to impute another value (like... 42 or whatever), and maybe we have some grouping variable, for which we only want the mean to computed over this can be included as well by

DF <- DF[, lapply(.SD, impute_na, val = 42)]
DF <- DF[, lapply(.SD, impute_NA), by = group]

Which would impute 42, and the within-group mean respectively.

Upvotes: 0

talat
talat

Reputation: 70286

The data.table syntax is a little different in such a case. You can do it as follows:

num_cols <- names(df)[sapply(df, is.numeric)]
for(col in num_cols) {
  set(df, i = which(is.na(df[[col]])), j = col, value = mean(df[[col]], na.rm=TRUE))
}

Or, if you want to keep using your existing loop, you can just turn the data back to data.frame using

setDF(df)

Upvotes: 2

Related Questions