Reputation: 21
I have a dataset with a very large number of string column variables representing procedural codes. There is another column of variables that represents the coding format (some are ICD9, some are other more arcane formats). Each observation is a patient. I need to:
There are too many variables for it to be trivial to create a list of each of them manually via c("cd1", "cd2", ...), and I potentially need to do this many different times so I'd like to make it as general a solution as possible.
Here's a simplified sample, where the strings I need to search for begin with "231":
ID cd1 type1 cd2 type2 cd3 type3 cd4 type4
1 "231" "02" "219" "02" "1321" "02" "2314" "02"
2 "222" "02"
3 "123" "142"
4 "145" "02" "521" "02" "2313" "02"
5 "231" "01"
The output I'd like is:
ID cd1 type1 cd2 type2 cd3 type3 cd4 type4 flag_var
1 "231" "02" "219" "02" "1321" "02" "2314" "02" 1
2 "222" "02" 0
3 "123" "142" 0
4 "145" "02" "521" "02" "2313" "02" 1
5 "231" "01" 0
(ID #5 is set to 0 because, even though the cd1 code is "231", the type1 variable is "01" and thus not in the correct coding format")
I've been somewhat successful in accomplishing this using mutate and case_when:
df <- df %>%
mutate(flag_var = case_when(
startsWith(cd1, "231") ~ 1,
startsWith(cd2, "231") ~ 1,
startsWith(cd3, "231") ~ 1,
startsWith(cd4, "231") ~ 1,
TRUE ~ 0))
Like I said, the actual dataset has too many variables and potential searches for it to make sense to hard-code it in the manner above. I think there should be a good way to do this using mutate_at or another dplyr function, but I haven't been able to figure it out.
I've been able to use the following code to get a set of new variables equal to 1 or 0, but not a single variable. I can then use rowSums to sum across all columns and check if this value is nonzero. But it's ugly and laborious:
df <- df %>% mutate_at(vars(starts_with("cd")),
funs(testvar = ifelse(startsWith(., "231"), 1, 0)))
test_names = df %>% select(ends_with("_testvar")) %>% names()
df <- df %>% mutate(flag_var = (rowSums(.[test_names]) == 1))
df <- df %>% select(-ends_with("_testvar"))
Anyone have any simpler ideas? Thanks so much!
EDIT: I realized that I also have to incorporate coding type variables. The initial sample data tables have been edited to reflect this.
Upvotes: 2
Views: 2320
Reputation: 17289
Here is another solution:
df$flag_var <- 1*(rowSums(substring(as.matrix(df[, 2:ncol(df)]), 1, 3) == '231') > 0)
# ID cd1 cd2 cd3 cd4 flag_var
# 1 1 231 219 1321 2314 1
# 2 2 222 0
# 3 3 123 142 0
# 4 4 145 521 2313 122 1
For the updated example, since the cd
columns and type
columns are paired, the following code should work:
cd.cols <- grepl('^cd', colnames(df))
type.cols <- grepl('^type', colnames(df))
flag <- substring(as.matrix(df[,cd.cols]), 1, 3) == '231' & df[,type.cols] == '02'
df$flag_var <- 1 * (rowSums(flag) > 0)
# > df
# ID cd1 type1 cd2 type2 cd3 type3 cd4 type4 flag_var
# 1 1 231 02 219 02 1321 02 2314 02 1
# 2 2 222 02 0
# 3 3 123 142 0
# 4 4 145 02 521 02 2313 02 1
# 5 5 231 01 0
Upvotes: 1
Reputation: 886938
We could loop through the columns and do a partial match with grepl
, Reduce
the list
of vector
s to a single logical vector
and coerce the values to binary
df$flag_var <- +(Reduce(`|`, lapply(df[-1], grepl, pattern = '^231')))
df <- structure(list(ID = c("1", "2", "3", "4"), cd1 = c("231", "222",
"123", "145"), cd2 = c("219", "", "142", "521"), cd3 = c("1321",
"", "", "2313"), cd4 = c("2314", "", "", "122")), row.names = c(NA,
-4L), class = "data.frame")
Upvotes: 1
Reputation: 2101
This might answer question or do you need the 0-1 as row values?
library(tidyverse)
dat <- tribble(~ID, ~cd1, ~cd2, ~cd3, ~cd4,
1, "231", "219", "1321", "2314",
2, "222", "" , "" , "",
3, "123", "142", "" , "",
4, "145", "521", "2313", "122")
dat %>%
gather("cd_type", "code", 2:5) %>%
mutate(flag_var = case_when(
startsWith(code, "231") ~ 1,
TRUE ~ 0
))
#> # A tibble: 16 x 4
#> ID cd_type code flag_var
#> <dbl> <chr> <chr> <dbl>
#> 1 1 cd1 231 1
#> 2 2 cd1 222 0
#> 3 3 cd1 123 0
#> 4 4 cd1 145 0
#> 5 1 cd2 219 0
#> 6 2 cd2 "" 0
#> 7 3 cd2 142 0
#> 8 4 cd2 521 0
#> 9 1 cd3 1321 0
#> 10 2 cd3 "" 0
#> 11 3 cd3 "" 0
#> 12 4 cd3 2313 1
#> 13 1 cd4 2314 1
#> 14 2 cd4 "" 0
#> 15 3 cd4 "" 0
#> 16 4 cd4 122 0
Or do this to get back into original wide format
dat %>%
gather("cd_type", "code", 2:5) %>%
mutate(flag_var = case_when(
startsWith(code, "231") ~ 1,
TRUE ~ 0
)) %>%
spread(cd_type, code) %>%
select(ID, cd1:cd4, flag_var)
#> # A tibble: 6 x 6
#> ID cd1 cd2 cd3 cd4 flag_var
#> <dbl> <chr> <chr> <chr> <chr> <dbl>
#> 1 1 <NA> 219 1321 <NA> 0
#> 2 1 231 <NA> <NA> 2314 1
#> 3 2 222 "" "" "" 0
#> 4 3 123 142 "" "" 0
#> 5 4 145 521 <NA> 122 0
#> 6 4 <NA> <NA> 2313 <NA> 1
Created on 2019-01-19 by the reprex package (v0.2.1)
Upvotes: 1