Reputation: 411
I have a data frame(cat_df) which has categorical variables only. I want to impute mode values to missing values in each variable.
I tried the following code. But It's not working.
Way -1
cat_df[is.na(cat_df)] <- modefunc(cat_df, na.rm = TRUE)
cat_df
modefunc <- function(x){
tabresult <- tabulate(x)
themode <- which(tabresult == max(tabresult))
if(sum(tabresult == max(tabresult))>1) themode <- NA
return(themode)
}
Error in modefunc(cat_df, na.rm = TRUE) : unused argument (na.rm = TRUE)
Way -2
cat_df[is.na(cat_df)] <- my_mode(cat_df[!is.na(cat_df)])
cat_df
my_mode <- function(x){
unique_x <- unique(x)
mode <- unique_x[which.max(tabulate(match(x,unique_x)))]
mode
}
The above code is not not imputing the mode values
Is there any other way to impute mode values to categoriacal variables in a dataframe?
Upvotes: 0
Views: 1590
Reputation: 79184
Update:
my_mode <- function (x, na.rm) {
xtab <- table(x)
xmode <- names(which(xtab == max(xtab)))
if (length(xmode) > 1) xmode <- ">1 mode"
return(xmode)
}
for (var in 1:ncol(cat_df)) {
if (class(cat_df[,var])=="numeric") {
cat_df[is.na(cat_df[,var]),var] <- mean(cat_df[,var], na.rm = TRUE)
} else if (class(cat_df[,var]) %in% c("character", "factor")) {
cat_df[is.na(cat_df[,var]),var] <- my_mode(cat_df[,var], na.rm = TRUE)
}
}
This mode function is for vectors Try this and please let me know.
#define missing values in vector
values <- unique(cat_column)[!is.na(cat_column)]
# mode of cat_column
themode <- values[which.max(tabulate(match(cat_column, values)))]
#assign missing vector
imputevector <- cat_column
imputevector[is.na(imputevector)] <- themode
Upvotes: 1
Reputation: 8886
Here is the mode function I use with an additional line to choose a single mode in the event there are actually multiple modes:
my_mode <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
mode <- ux[tab == max(tab)]
ifelse(length(mode) > 1, sample(mode, 1), mode)
}
# single mode
cat_col_1 <- c(1, 1, 2, NA)
cat_col_1
#> [1] 1 1 2 NA
cat_col_1[is.na(cat_col_1)] <- my_mode(cat_col_1)
cat_col_1
#> [1] 1 1 2 1
# random sample among multimodal
cat_col_2 <- c(1, 1, 2, 2, NA)
cat_col_2
#> [1] 1 1 2 2 NA
cat_col_2[is.na(cat_col_2)] <- my_mode(cat_col_2)
cat_col_2
#> [1] 1 1 2 2 2
But other folks have written mode functions. One possibility is in the DescTools
package and is named Mode()
.
Because it returns multiple modes in the event there are more than one, you would need to decide what to do in that event.
Here is an example to randomly sample with replacement, the necessary number of modes to replace the missing values.
# single mode
cat_col_3 <- c(1, 1, 2, NA)
cat_col_3
#> [1] 1 1 2 NA
cat_col_3_modes <- DescTools::Mode(cat_col_3, na.rm = TRUE)
cat_col_3_nmiss <- sum(is.na(cat_col_3))
cat_col_3[is.na(cat_col_3)] <- sample(cat_col_3_modes, cat_col_3_nmiss, TRUE)
cat_col_3
#> [1] 1 1 2 1
# random sample among multimodal
cat_col_4 <- c(1, 1, 2, 2, NA, NA)
cat_col_4
#> [1] 1 1 2 2 NA NA
cat_col_4_modes <- DescTools::Mode(cat_col_4, na.rm = TRUE)
cat_col_4_nmiss <- sum(is.na(cat_col_4))
cat_col_4[is.na(cat_col_4)] <- sample(cat_col_4_modes, cat_col_4_nmiss, TRUE)
cat_col_4
#> [1] 1 1 2 2 2 1
Created on 2021-04-16 by the reprex package (v1.0.0)
Upvotes: 0