Eisen
Eisen

Reputation: 1887

Transforming dummy variables to single column in R

I have the following table in R which lists a person race, gender, age, and cholesterol test. age and cholesterol test are displayed as dummy variables. age can be categorized as low, medium, or high, while cholesterol tests can be categorized as low or high. I want to transform the age and cholesterol columns to be single columns where low is categorized as 1, medium is categorized as 2, and high is categorized as 3. Cholesterol test can be neigh low or high if a person never took one and should be N/A in the expected output. I want the solution to be dynamic so that if I have multiple columns in this format, the code would still work (i.e. there may be some new tests, which can be categorized as high, low, or medium as dummy variables).

How can I do this in R?

input:

  race  gender age.low_tm1 age.medium_tm1 age.high_tm1 chol_test.low_tm1 chol_test.high_tm1
  <chr>  <int>       <int>          <int>        <int>             <int>              <int>
1 white      0           1              0            0                 0                  0
2 white      0           1              0            0                 0                  0
3 white      1           1              0            0                 0                  0
4 black      1           0              1            0                 0                  0
5 white      0           0              0            1                 0                  1
6 black      0           0              1            0                 1                  0

expected output:

  race  gender   age  chol_test
1 white      0     1        n/a  
2 white      0     1        n/a
3 white      1     1        n/a
4 black      1     2        n/a
5 white      0     3          3
6 black      0     2          1

Upvotes: 4

Views: 772

Answers (2)

TimTeaFan
TimTeaFan

Reputation: 18541

We could first define a custom function that allows us to recode dummy variables based on their variable names, below called var_nm2value.

This function takes the values of the variables as x argument. In dplyr::across this is the .x part. And it takes a list of name-value pairs as value_ls argument. The function just loops over the list of name-value pairs, checks if the name in value_ls is found in the variable name. To do this it uses grepl on dplyr::cur_column(). If we have a match then we replace all 1s with the value from our value_ls and we return all other values, that is the zeros, as is.

Then we can define a list of recode values, below recode_ls.

Finally, we use purrr::map_dfc in a dplyr::summarise where we use the variable strings we want to create "age" and "chol_test", then ii) select only columns which contain this string, and in each iteration we iii) apply dplyr::across to recode the values, iv) pipe the result in a do.call to get the max and finally v) recode 0s to NA:

# custom function to recode  0/1 dummy variables based on their variable name an 
var_nm2value <- function(x, values_ls) {
  for (val in seq_along(values_ls)) {
    if(grepl(names(values_ls)[val], dplyr::cur_column())) {
      return(ifelse(x == 1L, values_ls[[val]], x))
    } 
  }
}

# define list of recode values
recode_ls <- list(low = 1, medium = 2, high = 3)

library(tidyverse)

# apply functions to data.frame
df1 %>% 
  summarise(race = race,
            gender = gender,
            map_dfc(set_names(c("age", "chol_test")), # i)
                    function(x) { 
                      select(., contains(x)) %>% # ii)
                        summarise("{x}" := across(everything(), var_nm2value, recode_ls) %>% # iii)
                                    do.call("pmax", .) %>% # iv) 
                                    ifelse(. == 0, NA, .))} # v)
            )) 

#>    race gender age chol_test
#> 1 white      0   1        NA
#> 2 white      0   1        NA
#> 3 white      1   1        NA
#> 4 black      1   2        NA
#> 5 white      0   3         3
#> 6 black      0   2         1

Created on 2022-01-03 by the reprex package (v0.3.0)

Upvotes: 0

akrun
akrun

Reputation: 886938

Perhaps this helps

library(dplyr)
library(tidyr)
library(stringr)
df1 %>% 
   mutate(across(contains("_"),  ~  
   . * setNames(1:3, c("low", "medium", "high"))[
     str_extract(cur_column(), "low|medium|high")]))   %>%    
  rename_with(~ str_remove(., "_tm1")) %>% 
  pivot_longer(cols = -c(race, gender), 
    names_to = c(".value", "categ"), names_sep = "\\.") %>% 
  filter(age > 0|chol_test > 0) %>% 
  select(-categ) %>% 
  mutate(chol_test = na_if(chol_test, 0))

-output

# A tibble: 7 × 4
  race  gender   age chol_test
  <chr>  <int> <int>     <int>
1 white      0     1        NA
2 white      0     1        NA
3 white      1     1        NA
4 black      1     2        NA
5 white      0     3         3
6 black      0     0         1
7 black      0     2        NA

data

df1 <- structure(list(race = c("white", "white", "white", "black", "white", 
"black"), gender = c(0L, 0L, 1L, 1L, 0L, 0L), age.low_tm1 = c(1L, 
1L, 1L, 0L, 0L, 0L), age.medium_tm1 = c(0L, 0L, 0L, 1L, 0L, 1L
), age.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 0L), chol_test.low_tm1 = c(0L, 
0L, 0L, 0L, 0L, 1L), chol_test.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 
0L)), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6"))

Upvotes: 0

Related Questions