Anne
Anne

Reputation: 411

Mode imputation for categorical variables in a dataframe

I have a data frame(cat_df) which has categorical variables only. I want to impute mode values to missing values in each variable.

I tried the following code. But It's not working.

Way -1

cat_df[is.na(cat_df)] <- modefunc(cat_df, na.rm = TRUE)
cat_df

modefunc <- function(x){
  tabresult <- tabulate(x)
  themode <- which(tabresult == max(tabresult))
  if(sum(tabresult == max(tabresult))>1) themode <- NA
  return(themode)
}

Error in modefunc(cat_df, na.rm = TRUE) : unused argument (na.rm = TRUE)

Way -2

cat_df[is.na(cat_df)] <- my_mode(cat_df[!is.na(cat_df)])
cat_df

my_mode <- function(x){
  unique_x <- unique(x)
  mode <- unique_x[which.max(tabulate(match(x,unique_x)))]
  mode
}

The above code is not not imputing the mode values

Is there any other way to impute mode values to categoriacal variables in a dataframe?

Upvotes: 0

Views: 1590

Answers (2)

TarJae
TarJae

Reputation: 79184

Update:

  • This Mode function is for dataframes:
my_mode <- function (x, na.rm) {
  xtab <- table(x)
  xmode <- names(which(xtab == max(xtab)))
  if (length(xmode) > 1) xmode <- ">1 mode"
  return(xmode)
}

for (var in 1:ncol(cat_df)) {
  if (class(cat_df[,var])=="numeric") {
    cat_df[is.na(cat_df[,var]),var] <- mean(cat_df[,var], na.rm = TRUE)
  } else if (class(cat_df[,var]) %in% c("character", "factor")) {
    cat_df[is.na(cat_df[,var]),var] <- my_mode(cat_df[,var], na.rm = TRUE)
  }
}

This mode function is for vectors Try this and please let me know.

#define missing values in vector
values <- unique(cat_column)[!is.na(cat_column)]
# mode of cat_column
themode <- values[which.max(tabulate(match(cat_column, values)))] 
#assign missing vector
imputevector <- cat_column                                  
imputevector[is.na(imputevector)] <- themode

Upvotes: 1

the-mad-statter
the-mad-statter

Reputation: 8886

User Defined Function

Here is the mode function I use with an additional line to choose a single mode in the event there are actually multiple modes:

my_mode <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  mode <- ux[tab == max(tab)]
  ifelse(length(mode) > 1, sample(mode, 1), mode)
}

# single mode
cat_col_1 <- c(1, 1, 2, NA)
cat_col_1
#> [1]  1  1  2 NA
cat_col_1[is.na(cat_col_1)] <- my_mode(cat_col_1)
cat_col_1
#> [1] 1 1 2 1

# random sample among multimodal
cat_col_2 <- c(1, 1, 2, 2, NA)
cat_col_2
#> [1]  1  1  2  2 NA
cat_col_2[is.na(cat_col_2)] <- my_mode(cat_col_2)
cat_col_2
#> [1] 1 1 2 2 2

DescTools::Mode()

But other folks have written mode functions. One possibility is in the DescTools package and is named Mode().

Because it returns multiple modes in the event there are more than one, you would need to decide what to do in that event.

Here is an example to randomly sample with replacement, the necessary number of modes to replace the missing values.

# single mode
cat_col_3 <- c(1, 1, 2, NA)
cat_col_3
#> [1]  1  1  2 NA
cat_col_3_modes <- DescTools::Mode(cat_col_3, na.rm = TRUE)
cat_col_3_nmiss <- sum(is.na(cat_col_3))
cat_col_3[is.na(cat_col_3)] <- sample(cat_col_3_modes, cat_col_3_nmiss, TRUE)
cat_col_3
#> [1] 1 1 2 1

# random sample among multimodal
cat_col_4 <- c(1, 1, 2, 2, NA, NA)
cat_col_4
#> [1]  1  1  2  2 NA NA
cat_col_4_modes <- DescTools::Mode(cat_col_4, na.rm = TRUE)
cat_col_4_nmiss <- sum(is.na(cat_col_4))
cat_col_4[is.na(cat_col_4)] <- sample(cat_col_4_modes, cat_col_4_nmiss, TRUE)
cat_col_4
#> [1] 1 1 2 2 2 1

Created on 2021-04-16 by the reprex package (v1.0.0)

Upvotes: 0

Related Questions