sergio_ag
sergio_ag

Reputation: 49

How do I automatically convert columns to factor datatype if all the observations are all 0 or 1?

I have a very large dataset where some of the variables are currently integers or doubles, but should be factors. Since these observations in these columns are either 0, 1, or NA, how do I convert all of them to factors in dplyr?

Upvotes: 0

Views: 1160

Answers (3)

B. Christian Kamgang
B. Christian Kamgang

Reputation: 6489

Another dplyr approach to reach your goal. I used the built-in dataset mtcars because some columns (vs and am) of type double are binary (0 and 1).

df <- mtcars %>% 
  mutate(across(where( ~ setequal(na.omit(.x), 0:1)), as.factor))

glimpse(df)
# Rows: 32
# Columns: 11
# $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2,~
# $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4,~
# $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140~
# $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 18~
# $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92,~
# $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.1~
# $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.~
# $ vs   <fct> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,~
# $ am   <fct> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,~
# $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4,~
# $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1,~

Upvotes: 0

TimTeaFan
TimTeaFan

Reputation: 18551

The canonical dplyr-way would be to write a custom predicate function that returns TRUE or FALSE for each column depending on whether the conditions are matched and use this function inside across(where(predicate_function), ...).

Below I borrow the example data from @Tob and add some variations (one column is 0, 1 but double, one column contains NAs, one column is a numeric column which contains other values).

library(dplyr)

test_data <- tibble(strings = c("a", "b", "c", "d", "e"), 
                    col_2 = c(1, 0, 0, 0, NA), 
                    col_3 = as.double(c(0, 1, 1, 0, 1)),
                    col_4 = c(0L, 1L, 1L, 0L, 1L),
                    col_5 = 1:5)

# let's have a look at the data and the column types
test_data

#> # A tibble: 5 x 5
#>   strings col_2 col_3 col_4 col_5
#>   <chr>   <dbl> <dbl> <int> <int>
#> 1 a           1     0     0     1
#> 2 b           0     1     1     2
#> 3 c           0     1     1     3
#> 4 d           0     0     0     4
#> 5 e          NA     1     1     5

# predicate function
is_01_col <- function(x) {
  all(unique(x) %in% c(0, 1, NA))
}

test_data %>% 
  mutate(across(where(is_01_col), as.factor)) %>%
  glimpse
#> Rows: 5
#> Columns: 5
#> $ strings <chr> "a", "b", "c", "d", "e"
#> $ col_2   <fct> 1, 0, 0, 0, NA
#> $ col_3   <fct> 0, 1, 1, 0, 1
#> $ col_4   <fct> 0, 1, 1, 0, 1
#> $ col_5   <int> 1, 2, 3, 4, 5

Created on 2021-07-26 by the reprex package (v0.3.0)

Upvotes: 3

Tob
Tob

Reputation: 245

This is what I might do but I don't know how fast it will if your data is large

# Create some data
test_data <- data.frame(strings = c("a", "b", "c", "d", "e"), 
                col_2 = c(1, 0, 0, 0, 1), 
                col_3 = c( 0,1, 1, 0, 1))


# Find columns that are only 0s and 1s
cols_to_convert <- names(test_data)[lapply(test_data, function(x) identical(sort(unique(x)),  c(0,1)))  == TRUE] 

# Convert these columns to factors 
new_data <- test_data %>% mutate(across(all_of(cols_to_convert),  ~ as.factor(.x)))

# Check that the columns are factors
lapply(new_data, class)


Upvotes: 1

Related Questions