Reputation: 1887
I have the following table in R which lists a person race, gender, age, and cholesterol test. age and cholesterol test are displayed as dummy variables. age can be categorized as low, medium, or high, while cholesterol tests can be categorized as low or high. I want to transform the age and cholesterol columns to be single columns where low is categorized as 1, medium is categorized as 2, and high is categorized as 3. Cholesterol test can be neigh low or high if a person never took one and should be N/A in the expected output. I want the solution to be dynamic so that if I have multiple columns in this format, the code would still work (i.e. there may be some new tests, which can be categorized as high, low, or medium as dummy variables).
How can I do this in R?
input:
race gender age.low_tm1 age.medium_tm1 age.high_tm1 chol_test.low_tm1 chol_test.high_tm1
<chr> <int> <int> <int> <int> <int> <int>
1 white 0 1 0 0 0 0
2 white 0 1 0 0 0 0
3 white 1 1 0 0 0 0
4 black 1 0 1 0 0 0
5 white 0 0 0 1 0 1
6 black 0 0 1 0 1 0
expected output:
race gender age chol_test
1 white 0 1 n/a
2 white 0 1 n/a
3 white 1 1 n/a
4 black 1 2 n/a
5 white 0 3 3
6 black 0 2 1
Upvotes: 4
Views: 772
Reputation: 18541
We could first define a custom function that allows us to recode dummy variables based on their variable names, below called var_nm2value
.
This function takes the values of the variables as x
argument. In dplyr::across
this is the .x
part. And it takes a list of name-value pairs as value_ls
argument. The function just loops over the list of name-value pairs, checks if the name in value_ls
is found in the variable name. To do this it uses grepl
on dplyr::cur_column()
. If we have a match then we replace all 1
s with the value from our value_ls
and we return all other values, that is the zeros, as is.
Then we can define a list of recode values, below recode_ls
.
Finally, we use purrr::map_dfc
in a dplyr::summarise
where we use the variable strings we want to create "age"
and "chol_test"
, then ii) select
only columns which contain this string, and in each iteration we iii) apply dplyr::across
to recode the values, iv) pipe the result in a do.call
to get the max
and finally v) recode 0
s to NA
:
# custom function to recode 0/1 dummy variables based on their variable name an
var_nm2value <- function(x, values_ls) {
for (val in seq_along(values_ls)) {
if(grepl(names(values_ls)[val], dplyr::cur_column())) {
return(ifelse(x == 1L, values_ls[[val]], x))
}
}
}
# define list of recode values
recode_ls <- list(low = 1, medium = 2, high = 3)
library(tidyverse)
# apply functions to data.frame
df1 %>%
summarise(race = race,
gender = gender,
map_dfc(set_names(c("age", "chol_test")), # i)
function(x) {
select(., contains(x)) %>% # ii)
summarise("{x}" := across(everything(), var_nm2value, recode_ls) %>% # iii)
do.call("pmax", .) %>% # iv)
ifelse(. == 0, NA, .))} # v)
))
#> race gender age chol_test
#> 1 white 0 1 NA
#> 2 white 0 1 NA
#> 3 white 1 1 NA
#> 4 black 1 2 NA
#> 5 white 0 3 3
#> 6 black 0 2 1
Created on 2022-01-03 by the reprex package (v0.3.0)
Upvotes: 0
Reputation: 886938
Perhaps this helps
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(across(contains("_"), ~
. * setNames(1:3, c("low", "medium", "high"))[
str_extract(cur_column(), "low|medium|high")])) %>%
rename_with(~ str_remove(., "_tm1")) %>%
pivot_longer(cols = -c(race, gender),
names_to = c(".value", "categ"), names_sep = "\\.") %>%
filter(age > 0|chol_test > 0) %>%
select(-categ) %>%
mutate(chol_test = na_if(chol_test, 0))
-output
# A tibble: 7 × 4
race gender age chol_test
<chr> <int> <int> <int>
1 white 0 1 NA
2 white 0 1 NA
3 white 1 1 NA
4 black 1 2 NA
5 white 0 3 3
6 black 0 0 1
7 black 0 2 NA
df1 <- structure(list(race = c("white", "white", "white", "black", "white",
"black"), gender = c(0L, 0L, 1L, 1L, 0L, 0L), age.low_tm1 = c(1L,
1L, 1L, 0L, 0L, 0L), age.medium_tm1 = c(0L, 0L, 0L, 1L, 0L, 1L
), age.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 0L), chol_test.low_tm1 = c(0L,
0L, 0L, 0L, 0L, 1L), chol_test.high_tm1 = c(0L, 0L, 0L, 0L, 1L,
0L)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6"))
Upvotes: 0