Reputation: 49
I have a very large dataset where some of the variables are currently integers or doubles, but should be factors. Since these observations in these columns are either 0
, 1
, or NA
, how do I convert all of them to factors in dplyr?
Upvotes: 0
Views: 1160
Reputation: 6489
Another dplyr approach to reach your goal. I used the built-in dataset mtcars
because some columns (vs
and am
) of type double
are binary (0 and 1).
df <- mtcars %>%
mutate(across(where( ~ setequal(na.omit(.x), 0:1)), as.factor))
glimpse(df)
# Rows: 32
# Columns: 11
# $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2,~
# $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4,~
# $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140~
# $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 18~
# $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92,~
# $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.1~
# $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.~
# $ vs <fct> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,~
# $ am <fct> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,~
# $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4,~
# $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1,~
Upvotes: 0
Reputation: 18551
The canonical dplyr-way would be to write a custom predicate function that returns TRUE
or FALSE
for each column depending on whether the conditions are matched and use this function inside across(where(predicate_function), ...)
.
Below I borrow the example data from @Tob and add some variations (one column is 0
, 1
but double, one column contains NA
s, one column is a numeric column which contains other values).
library(dplyr)
test_data <- tibble(strings = c("a", "b", "c", "d", "e"),
col_2 = c(1, 0, 0, 0, NA),
col_3 = as.double(c(0, 1, 1, 0, 1)),
col_4 = c(0L, 1L, 1L, 0L, 1L),
col_5 = 1:5)
# let's have a look at the data and the column types
test_data
#> # A tibble: 5 x 5
#> strings col_2 col_3 col_4 col_5
#> <chr> <dbl> <dbl> <int> <int>
#> 1 a 1 0 0 1
#> 2 b 0 1 1 2
#> 3 c 0 1 1 3
#> 4 d 0 0 0 4
#> 5 e NA 1 1 5
# predicate function
is_01_col <- function(x) {
all(unique(x) %in% c(0, 1, NA))
}
test_data %>%
mutate(across(where(is_01_col), as.factor)) %>%
glimpse
#> Rows: 5
#> Columns: 5
#> $ strings <chr> "a", "b", "c", "d", "e"
#> $ col_2 <fct> 1, 0, 0, 0, NA
#> $ col_3 <fct> 0, 1, 1, 0, 1
#> $ col_4 <fct> 0, 1, 1, 0, 1
#> $ col_5 <int> 1, 2, 3, 4, 5
Created on 2021-07-26 by the reprex package (v0.3.0)
Upvotes: 3
Reputation: 245
This is what I might do but I don't know how fast it will if your data is large
# Create some data
test_data <- data.frame(strings = c("a", "b", "c", "d", "e"),
col_2 = c(1, 0, 0, 0, 1),
col_3 = c( 0,1, 1, 0, 1))
# Find columns that are only 0s and 1s
cols_to_convert <- names(test_data)[lapply(test_data, function(x) identical(sort(unique(x)), c(0,1))) == TRUE]
# Convert these columns to factors
new_data <- test_data %>% mutate(across(all_of(cols_to_convert), ~ as.factor(.x)))
# Check that the columns are factors
lapply(new_data, class)
Upvotes: 1