Reputation: 181
I want to convert the missing values of all the categorical variables in my data set to 'None'. I have more than 100 factor variables and i want to do it at once for all of them without using their names in the code.
Suppose i have the following data set (just as an example) and i want to replace 'NA's for all the factor variables like "x" and "y" here with 'None' as a level.
x = data.frame(x = as.factor(c(1, 2, NA, 3)), y = as.factor(c(NA, NA, 4, 5)), z=c(1,0,2,NA) )
Upvotes: 1
Views: 2029
Reputation: 6244
This can also be done neatly in base R using rapply
:
## data
dat <- data.frame(x = as.factor(c(1, 2, NA, 3)), y = as.factor(c(NA, NA, 4, 5)), z=c(1,0,2,NA))
## rapply
rapply(dat, function(col) {
if(any(is.na(col))) {
col_na <- addNA(col)
levels(col_na) <- c(levels(col), "None")
col_na
} else col
}, classes = "factor", how = "replace")
#> x y z
#> 1 1 None 1
#> 2 2 None 0
#> 3 None 4 2
#> 4 3 5 NA
Benchmarks
Below some small benchmarking against @tmfmk's tidyverse approach:
library(dplyr)
library(forcats)
## function to create simulated data.frame
create_df <- function(df_size, prop_NA){
v <- sample(1:100, df_size^2, replace = TRUE)
v[sample(seq_len(df_size^2), prop_NA * df_size^2)] <- NA
data.frame(apply(matrix(v, ncol = df_size), 1, as.factor))
}
## rapply approach
replace_NA_rapply <- function(dat) {
rapply(dat, function(col) {
if(any(is.na(col))) {
col_na <- addNA(col)
levels(col_na) <- c(levels(col), "None")
col_na
} else col
}, classes = "factor", how = "replace")
}
## tidyverse approach
replace_NA_tidy <- function(dat) {
mutate_if(dat, is.factor, ~ fct_explicit_na(., na_level = "None"))
}
## benchmark several data.frame sizes
bnch <- bench::press(
df_size = c(10, 100, 1E3),
prop_NA = c(0.05, 0.5),
{
dat <- create_df(df_size, prop_NA)
bench::mark(
rapply = replace_NA_rapply(dat),
tidyverse = replace_NA_tidy(dat),
min_iterations = 50
)
}
)
bnch
#> # A tibble: 12 x 8
#> expression df_size prop_NA min median `itr/sec` mem_alloc
#> <bch:expr> <dbl> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 rapply 10 0.05 118.13µs 132.45µs 7274. 0B
#> 2 tidyverse 10 0.05 1.67ms 1.82ms 529. 1.74MB
#> 3 rapply 100 0.05 3.35ms 3.79ms 256. 1.25MB
#> 4 tidyverse 100 0.05 18.93ms 20.36ms 48.2 1.34MB
#> 5 rapply 1000 0.05 58.3ms 77.28ms 12.3 41.91MB
#> 6 tidyverse 1000 0.05 260.24ms 291.22ms 3.40 52.88MB
#> 7 rapply 10 0.5 253.2µs 286.64µs 3097. 0B
#> 8 tidyverse 10 0.5 2.17ms 2.48ms 369. 1.91KB
#> 9 rapply 100 0.5 3.07ms 3.58ms 272. 1005.53KB
#> 10 tidyverse 100 0.5 18.59ms 19.58ms 50.7 1.12MB
#> 11 rapply 1000 0.5 57.73ms 76.63ms 12.4 41.83MB
#> 12 tidyverse 1000 0.5 249.06ms 319.42ms 3.08 54.53MB
Upvotes: 0
Reputation: 6479
You can convert to character, replace, and convert back to factor, e.g. like this:
df <- data.frame(x = as.factor(c(1, 2, NA, 3)), y = as.factor(c(NA, NA, 4, 5)), z=c(1,0,2,NA) )
isf <- sapply(df, is.factor) # check which columns are factors
df[, isf] <- lapply(df[, isf], function(.){
. <- as.character(.) # convert to character
.[is.na(.)] <- "None" # replace NA by "none"
factor(.) # return a factor
})
A shorter version of the working part:
df[, isf] <- lapply(df[, isf], function(.)
factor(replace(as.character(.), is.na(.), "None"))
)
Another strategy (perhaps more elegant) is to first add "None" to the levels of each factor and then replace NA
's by "None"
:
df <- data.frame(x = as.factor(c(1, 2, NA, 3)), y = as.factor(c(NA, NA, 4, 5)), z=c(1,0,2,NA) )
isf <- sapply(df, is.factor) # check which columns are factors
df[, isf] <- lapply(df[, isf], function(.){
levels(.) <- c(levels(.), "None")
replace(., is.na(.), "None")
})
Upvotes: 1
Reputation: 501
Here is one more answer without any package. Just convert to matrix and convert it all NA to 'None' and convert it back to dataframe. Its fast as well
x=as.matrix(data.frame(x = as.factor(c(1, 2, NA, 3)), y = as.factor(c(NA, NA, 4, 5)), z=c(1,0,2,NA) ))
x[is.na(x)] = 'None'
as.data.frame(x)
Upvotes: 0
Reputation: 40171
Another tidyverse
possibility that preserves the factor class:
x %>%
mutate_if(is.factor, ~ fct_explicit_na(., na_level = "None"))
x y z
1 1 None 1
2 2 None 0
3 None 4 2
4 3 5 NA
Upvotes: 2
Reputation: 2364
You can use replace_na
from the tidyr
package like so:
library(tidyr)
library(dplyr)
x %>%
mutate(x = as.character(x), y = as.character(y)) %>%
replace_na(list(x = "None", y = "None"))
Note that you first have to convert the colums of interest into characters so that the can hold the "None"
string.
Upvotes: 0