Reputation: 729
I want to impute missing values for few set of columns. The idea is for numeric variables I want to use the median to impute the NA
and for categorical variables I want to use the mode to impute the NA
. I did search for how to impute it separately for different set of columns and did not find.
My data is big with many columns so I have it in data.table. Since I am not sure how to do it in data.table, I tried below code base R. I have tried below code but somehow I am messing up with the column name identification it seems.
My data is large and with multiple variables. I am storing numeric variables in vector var_num and I am storing categorical variables in vector var_chr.
Please see my sample code below -
library(data.table)
set.seed(1200)
id <- 1:100
bills <- sample(c(1:20,NA),100,replace = T)
nos <- sample(c(1:80,NA),100,replace = T)
stru <- sample(c("A","B","C","D",NA),100,replace = T)
type <- sample(c(1:7,NA),100,replace = T)
value <- sample(c(100:1000,NA),100,replace = T)
df1 <- as.data.table(data.frame(id,bills,nos,stru,type,value))
class(df1)
var_num <- c("bills","nos","value")
var_chr <- c("stru","type")
impute <- function(x){
#print(x)
if(colnames(x) %in% var_num){
x[is.na(x)] = median(x,na.rm = T)
} else if (colnames(x) %in% var_chr){
x[is.na(x)] = mode(x)
} else {
x #if not part of var_num and var_chr then nothing needs to be done and return the original value
}
return(x)
}
df1_imp_med <- data.frame(apply(df1,2,impute))
When I try to run the above it gives me error Error in if (colnames(x) %in% var_num) { : argument is of length zero
Please help me understand how I can correct this and achieve my requirement.
Upvotes: 1
Views: 2808
Reputation: 28705
Another option using lapply
lapply(c(var_num, var_chr), function(x){
imp.fun <- ifelse(x %in% var_num
, function(x) median(x, na.rm = T)
, function(x) names(which.max(table(x))))
df1[is.na(df1[[x]]), (x) := imp.fun(df1[[x]])]})
Upvotes: 0
Reputation: 21749
As suggested in comments, you can use for-set
combination in data.table for a faster imputation:
for(k in names(df1)){
if(k %in% var_num){
# impute numeric variables with median
med <- median(df1[[k]],na.rm = T)
set(x = df1, which(is.na(df1[[k]])), k, med)
} else if(k %in% var_char){
## impute categorical variables with mode
mode <- names(which.max(table(df1[[k]])))
set(x = df1, which(is.na(df1[[k]])), k, mode)
}
}
Upvotes: 7
Reputation: 729
I managed to get a working solution. One of the key things was to refer to the variables specified in var_num and var_chr for numeric and categorical imputation. Variables that are not specified in these vectors need not be imputed.
Challenge I was facing is to refer to them in the function. I dropped the idea of writing the function and managed to write a for loop as below -
df1 <- as.data.frame(df1)
for (var in 1:ncol(df1)) {
if (names(df1[var]) %in% var_num) {
df1[is.na(df1[,var]),var] <- median(df1[,var], na.rm = TRUE)
} else if (names(df1[var]) %in% var_chr) {
df1[is.na(df1[,var]),var] <- names(which.max(table(df1[,var])))
}
}
This for loop does the needed imputation.
If there is more simpler and concise way of achieving this do let me know. Maybe some apply family may do the trick.
Upvotes: 0
Reputation: 721
It may or may not be worth your time coding up a single function for both of your use cases. A direct (but specific) solution is below -- note that mode
may not be behaving as you expect, by reading ?mode
.
library(data.table)
set.seed(1200)
df1 <- data.table(
id = 1:100,
bills = sample(c(1:20,NA),100,replace = T),
nos = sample(c(1:80,NA),100,replace = T),
stru = sample(c("A","B","C","D",NA),100,replace = T),
type = sample(c(as.character(1:7),NA),100,replace = T),
value = sample(c(100:1000,NA),100,replace = T)
)
# Function to calculate the most frequent object in a vector:
getMode <- function(myvector) {
mytable <- table(myvector)
return(names(mytable)[which.max(mytable)])
}
# replace na values by reference, with `:=`
df1[is.na(bills), bills := median(df1[,bills], na.rm=T)]
df1[is.na(nos), nos := median(df1[,nos], na.rm=T)]
df1[is.na(value), value := median(df1[,value], na.rm=T)]
df1[is.na(stru), stru := getMode(df1[,stru])]
df1[is.na(type), type := getMode(df1[,type])]
Upvotes: 3