user1412
user1412

Reputation: 729

r data.table impute missing values for multiple set of columns

I want to impute missing values for few set of columns. The idea is for numeric variables I want to use the median to impute the NA and for categorical variables I want to use the mode to impute the NA. I did search for how to impute it separately for different set of columns and did not find.

My data is big with many columns so I have it in data.table. Since I am not sure how to do it in data.table, I tried below code base R. I have tried below code but somehow I am messing up with the column name identification it seems.

My data is large and with multiple variables. I am storing numeric variables in vector var_num and I am storing categorical variables in vector var_chr.

Please see my sample code below -

library(data.table)
set.seed(1200)
id <- 1:100
bills <- sample(c(1:20,NA),100,replace = T)
nos <- sample(c(1:80,NA),100,replace = T)
stru <- sample(c("A","B","C","D",NA),100,replace = T)
type <- sample(c(1:7,NA),100,replace = T)
value <- sample(c(100:1000,NA),100,replace = T)

df1 <- as.data.table(data.frame(id,bills,nos,stru,type,value))
class(df1)

var_num <- c("bills","nos","value")
var_chr <- c("stru","type")

impute <- function(x){
  #print(x)
  if(colnames(x) %in% var_num){
    x[is.na(x)] = median(x,na.rm = T)
  } else if (colnames(x) %in% var_chr){
    x[is.na(x)] = mode(x)
  } else {
    x #if not part of var_num and var_chr then nothing needs to be done and return the original value
  }
  return(x)
}


df1_imp_med <- data.frame(apply(df1,2,impute))

When I try to run the above it gives me error Error in if (colnames(x) %in% var_num) { : argument is of length zero

Please help me understand how I can correct this and achieve my requirement.

Upvotes: 1

Views: 2808

Answers (4)

IceCreamToucan
IceCreamToucan

Reputation: 28705

Another option using lapply

lapply(c(var_num, var_chr), function(x){ 
  imp.fun <- ifelse(x %in% var_num
                   , function(x) median(x, na.rm = T) 
                   , function(x) names(which.max(table(x))))
  df1[is.na(df1[[x]]), (x) := imp.fun(df1[[x]])]})

Upvotes: 0

YOLO
YOLO

Reputation: 21749

As suggested in comments, you can use for-set combination in data.table for a faster imputation:

for(k in names(df1)){

      if(k %in% var_num){

        # impute numeric variables with median
        med <- median(df1[[k]],na.rm = T)
        set(x = df1, which(is.na(df1[[k]])), k, med)

    } else if(k %in% var_char){

        ## impute categorical variables with mode
        mode <- names(which.max(table(df1[[k]])))
        set(x = df1, which(is.na(df1[[k]])), k, mode)
    }
}

Upvotes: 7

user1412
user1412

Reputation: 729

I managed to get a working solution. One of the key things was to refer to the variables specified in var_num and var_chr for numeric and categorical imputation. Variables that are not specified in these vectors need not be imputed.

Challenge I was facing is to refer to them in the function. I dropped the idea of writing the function and managed to write a for loop as below -

df1 <- as.data.frame(df1)

for (var in 1:ncol(df1)) {
  if (names(df1[var]) %in% var_num) {
    df1[is.na(df1[,var]),var] <- median(df1[,var], na.rm = TRUE)
  } else if (names(df1[var]) %in% var_chr) {
    df1[is.na(df1[,var]),var] <- names(which.max(table(df1[,var])))
  }
}

This for loop does the needed imputation.

If there is more simpler and concise way of achieving this do let me know. Maybe some apply family may do the trick.

Upvotes: 0

caw5cv
caw5cv

Reputation: 721

It may or may not be worth your time coding up a single function for both of your use cases. A direct (but specific) solution is below -- note that mode may not be behaving as you expect, by reading ?mode.

library(data.table)

set.seed(1200)
df1 <- data.table(
id = 1:100,
bills = sample(c(1:20,NA),100,replace = T),
nos = sample(c(1:80,NA),100,replace = T),
stru = sample(c("A","B","C","D",NA),100,replace = T),
type = sample(c(as.character(1:7),NA),100,replace = T),
value = sample(c(100:1000,NA),100,replace = T)
)

# Function to calculate the most frequent object in a vector:
getMode <- function(myvector) {
    mytable <- table(myvector)
    return(names(mytable)[which.max(mytable)])
}

# replace na values by reference, with `:=`
df1[is.na(bills), bills := median(df1[,bills], na.rm=T)]
df1[is.na(nos), nos := median(df1[,nos], na.rm=T)]
df1[is.na(value), value := median(df1[,value], na.rm=T)]
df1[is.na(stru), stru := getMode(df1[,stru])]
df1[is.na(type), type := getMode(df1[,type])]

Upvotes: 3

Related Questions