datadigger
datadigger

Reputation: 181

Replace missing for all the categorical variable with 'None'

I want to convert the missing values of all the categorical variables in my data set to 'None'. I have more than 100 factor variables and i want to do it at once for all of them without using their names in the code.

Suppose i have the following data set (just as an example) and i want to replace 'NA's for all the factor variables like "x" and "y" here with 'None' as a level.

  x = data.frame(x = as.factor(c(1, 2, NA, 3)), y = as.factor(c(NA, NA, 4, 5)), z=c(1,0,2,NA) )

Upvotes: 1

Views: 2029

Answers (5)

Joris C.
Joris C.

Reputation: 6244

This can also be done neatly in base R using rapply:

## data
dat <- data.frame(x = as.factor(c(1, 2, NA, 3)), y = as.factor(c(NA, NA, 4, 5)), z=c(1,0,2,NA))  

## rapply
rapply(dat, function(col) {
      if(any(is.na(col))) {
        col_na <- addNA(col)
        levels(col_na) <- c(levels(col), "None")
        col_na
      } else col
    }, classes = "factor", how = "replace")
#>      x    y  z
#> 1    1 None  1
#> 2    2 None  0
#> 3 None    4  2
#> 4    3    5 NA

Benchmarks

Below some small benchmarking against @tmfmk's tidyverse approach:

library(dplyr)
library(forcats)

## function to create simulated data.frame
create_df <- function(df_size, prop_NA){
  v <- sample(1:100, df_size^2, replace = TRUE)
  v[sample(seq_len(df_size^2), prop_NA * df_size^2)] <- NA
  data.frame(apply(matrix(v, ncol = df_size), 1, as.factor))   
}

## rapply approach
replace_NA_rapply <- function(dat) {
  rapply(dat, function(col) {
        if(any(is.na(col))) {
          col_na <- addNA(col)
          levels(col_na) <- c(levels(col), "None")
          col_na
        } else col
      }, classes = "factor", how = "replace")
}

## tidyverse approach
replace_NA_tidy <- function(dat) { 
  mutate_if(dat, is.factor, ~ fct_explicit_na(., na_level = "None"))
}

## benchmark several data.frame sizes
bnch <- bench::press(
    df_size = c(10, 100, 1E3),
    prop_NA = c(0.05, 0.5),
    {
      dat <- create_df(df_size, prop_NA)
      bench::mark(
          rapply = replace_NA_rapply(dat),
          tidyverse = replace_NA_tidy(dat),
          min_iterations = 50
      )
    }
)

bnch
#> # A tibble: 12 x 8
#>    expression df_size prop_NA      min   median `itr/sec` mem_alloc
#>    <bch:expr>   <dbl>   <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#>  1 rapply          10    0.05 118.13µs 132.45µs   7274.          0B
#>  2 tidyverse       10    0.05   1.67ms   1.82ms    529.      1.74MB
#>  3 rapply         100    0.05   3.35ms   3.79ms    256.      1.25MB
#>  4 tidyverse      100    0.05  18.93ms  20.36ms     48.2     1.34MB
#>  5 rapply        1000    0.05   58.3ms  77.28ms     12.3    41.91MB
#>  6 tidyverse     1000    0.05 260.24ms 291.22ms      3.40   52.88MB
#>  7 rapply          10    0.5   253.2µs 286.64µs   3097.          0B
#>  8 tidyverse       10    0.5    2.17ms   2.48ms    369.      1.91KB
#>  9 rapply         100    0.5    3.07ms   3.58ms    272.   1005.53KB
#> 10 tidyverse      100    0.5   18.59ms  19.58ms     50.7     1.12MB
#> 11 rapply        1000    0.5   57.73ms  76.63ms     12.4    41.83MB
#> 12 tidyverse     1000    0.5  249.06ms 319.42ms      3.08   54.53MB

Upvotes: 0

lebatsnok
lebatsnok

Reputation: 6479

You can convert to character, replace, and convert back to factor, e.g. like this:

df <- data.frame(x = as.factor(c(1, 2, NA, 3)), y = as.factor(c(NA, NA, 4, 5)), z=c(1,0,2,NA) )

isf <-  sapply(df, is.factor)  # check which columns are factors
df[, isf] <- lapply(df[, isf], function(.){
  . <- as.character(.)  # convert to character
  .[is.na(.)] <- "None" # replace NA by "none"
  factor(.)             # return a factor 
})

A shorter version of the working part:

df[, isf] <- lapply(df[, isf], function(.)
  factor(replace(as.character(.), is.na(.), "None"))
)

Another strategy (perhaps more elegant) is to first add "None" to the levels of each factor and then replace NA's by "None":

df <- data.frame(x = as.factor(c(1, 2, NA, 3)), y = as.factor(c(NA, NA, 4, 5)), z=c(1,0,2,NA) )

isf <-  sapply(df, is.factor)  # check which columns are factors
df[, isf] <- lapply(df[, isf], function(.){
  levels(.) <- c(levels(.), "None")
  replace(., is.na(.), "None")
})

Upvotes: 1

Not_Dave
Not_Dave

Reputation: 501

Here is one more answer without any package. Just convert to matrix and convert it all NA to 'None' and convert it back to dataframe. Its fast as well

x=as.matrix(data.frame(x = as.factor(c(1, 2, NA, 3)), y = as.factor(c(NA, NA, 4, 5)), z=c(1,0,2,NA) ))

x[is.na(x)] = 'None'
as.data.frame(x)

Upvotes: 0

tmfmnk
tmfmnk

Reputation: 40171

Another tidyverse possibility that preserves the factor class:

x %>%
 mutate_if(is.factor, ~ fct_explicit_na(., na_level = "None"))

     x    y  z
1    1 None  1
2    2 None  0
3 None    4  2
4    3    5 NA

Upvotes: 2

eastclintw00d
eastclintw00d

Reputation: 2364

You can use replace_na from the tidyr package like so:


library(tidyr)               
library(dplyr)
x %>% 
  mutate(x = as.character(x), y = as.character(y)) %>% 
  replace_na(list(x = "None", y = "None"))

Note that you first have to convert the colums of interest into characters so that the can hold the "None" string.

Upvotes: 0

Related Questions