Reputation: 11
Edited to add code:
I am trying to replicate some work from a colleague that uses SAS. We're having an issue with the import in SAS which converts text (which matches boolean) to numeric.
The purpose of this work is to identify particular records to pass on, so we need the values to be preserved as originally imported (something I think R will be able to do). Right now we're fixing the issue manually because it's a small number of records but that may not always be true.
Where I'm hitting a snag is that I need to replicate their matrix array in R. There are multiple conditions that should be flagged with a 1 if they meet the condition, as follows: SAS Code
I need to be able to evaluate if there is one of 34 potential strings (or partial strings (in SAS, the colon shortens a comparison value to the same length as the evaluation value and compares them) in one of 12 columns (e.g. :Q16 means the string only need start with Q16). Additionally, any one of the 12 could have a value through it does get sparser in later fields.
I am trying to find the most efficient and compact approach, if possible.
I'm still somewhat new at R for more complex problems so I am stymied. I've tried a few approaches with grep and grepl but none have born any fruit. When I tried regex, I tried using each string individually in ifelse and then I also tried one larger string with the "|" operator but no luck either. I also tried base (apply) and dplyr approaches.
Any help is appreciated.
The structure of the data is: Example Table
Code for Example Data:
structure(list(record = 1:20,
icd1 = c("Q753", "Q620", "Q825", "Q211", "Q828", "Q6532", "Q673", "Q380", "Q5310", "Q040", "Q107", "Q6689", "Q860", "Q753", "Q000", "Q673", "Q860", "Q673", "H9190", "Q381"),
icd2 = c("Q141",NA,NA, "Q170", NA, NA, NA, NA, NA, NA, NA, "Q211", NA, NA, "Q211", "Q673", NA, "115", "Q759", "Q753"),
icd3 = c("Q579", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Q038", "H4657", "Q211"),
icd4 = c("Q656", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Q999", NA, NA),
icd5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Q5301", NA, NA),
icd6 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Q168", NA, NA),
icd7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
icd8 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
icd9 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
icd10 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
icd11 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
icd12 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)),
.Names = c("record", "icd1", "icd2", "icd3", "icd4", "icd5", "icd6", "icd7", "icd8", "icd9", "icd10", "icd11", "icd12"),
class = "data.frame", row.names = c(NA, -20L))
Strings of Interest:
case2 <- "^H4703| ^H90*| ^H91*| ^Q000| ^Q001| ^Q002| ^Q01*| ^Q02| ^Q03*|
^Q04*| ^Q05*| ^Q070*| ^Q110| ^Q111| ^Q112| ^Q120| ^Q122| ^Q130| ^Q138|
^Q139| ^Q141| ^Q142| ^Q143| ^Q148| ^Q149| ^Q16*| ^Q65*| ^Q66*| ^Q674|
^Q688| ^Q743| ^Q758| ^Q759| ^Q828"
Upvotes: 1
Views: 202
Reputation: 3440
It looks like you are reinventing comorbidity assignment, for which there are several R packages. I created icd about six years ago for this purpose and it is widely used in the community. It is accurate and very fast. For your application, I would create a custom comorbidity map (just a named list, with each member being a vector of matching code stems).
library(icd)
# x <- your_data_structure_above
examplemap <- list(
case2 = c("H4703", "H90", "H91", "Q000", "Q001", "Q002", "Q01", "Q02",
"Q03", "Q04", "Q05", "Q070", "Q110", "Q111", "Q112", "Q120",
"Q122", "Q130", "Q138", "Q139", "Q141", "Q142", "Q143", "Q148",
"Q149", "Q16", "Q65", "Q66*", "Q674", "Q688", "Q743", "Q758",
"Q759", "Q828"))
icd::comorbid(x, examplemap)
Upvotes: 0
Reputation: 8110
For this, I modified your string a bit. In short, i converted your dataframe from wide to long, then I summarized each column as either having (TRUE) or not having (FALSE) any of the strings you wanted.
case2 <- "H4703|H90|H91|Q000|Q001|Q002|Q01|Q02|Q03|Q04|Q05|Q070|Q110|Q111|Q112|Q120|Q122|Q130|Q138|Q139|Q141|Q142|Q143|Q148|Q149|Q16|Q65|Q66|Q674|Q688|Q743|Q758|Q759|Q828"
library(dplyr)
library(tidyr)
df %>%
gather(column, string, -record) %>%
group_by(column) %>%
summarise(contains_string = sum(grepl(case2, string))>0)
#> # A tibble: 12 x 2
#> column contains_string
#> <chr> <lgl>
#> 1 icd1 TRUE
#> 2 icd10 FALSE
#> 3 icd11 FALSE
#> 4 icd12 FALSE
#> 5 icd2 TRUE
#> 6 icd3 TRUE
#> 7 icd4 TRUE
#> 8 icd5 FALSE
#> 9 icd6 TRUE
#> 10 icd7 FALSE
#> 11 icd8 FALSE
#> 12 icd9 FALSE
EDIT
Here is for each row.
df %>%
gather(column, string, -record) %>%
group_by(record) %>%
mutate(contains_string = sum(grepl(case2, string))>0) %>%
spread(column, string)
#> # A tibble: 20 x 14
#> # Groups: record [20]
#> record contains_string icd1 icd10 icd11 icd12 icd2 icd3 icd4 icd5
#> <int> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 TRUE Q753 <NA> <NA> <NA> Q141 Q579 Q656 <NA>
#> 2 2 FALSE Q620 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 3 FALSE Q825 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 4 FALSE Q211 <NA> <NA> <NA> Q170 <NA> <NA> <NA>
#> 5 5 TRUE Q828 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 6 TRUE Q6532 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 7 7 FALSE Q673 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 8 8 FALSE Q380 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 9 9 FALSE Q5310 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 10 10 TRUE Q040 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 11 11 FALSE Q107 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 12 12 TRUE Q6689 <NA> <NA> <NA> Q211 <NA> <NA> <NA>
#> 13 13 FALSE Q860 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 14 14 FALSE Q753 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 15 15 TRUE Q000 <NA> <NA> <NA> Q211 <NA> <NA> <NA>
#> 16 16 FALSE Q673 <NA> <NA> <NA> Q673 <NA> <NA> <NA>
#> 17 17 FALSE Q860 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 18 18 TRUE Q673 <NA> <NA> <NA> 115 Q038 Q999 Q5301
#> 19 19 TRUE H9190 <NA> <NA> <NA> Q759 H4657 <NA> <NA>
#> 20 20 FALSE Q381 <NA> <NA> <NA> Q753 Q211 <NA> <NA>
#> # ... with 4 more variables: icd6 <chr>, icd7 <chr>, icd8 <chr>,
#> # icd9 <chr>
Created on 2018-09-19 by the reprex package (v0.2.0).
Upvotes: 1
Reputation: 66425
Here's a similar approach as @AndS, using pmatch
to get partial matches.
1) Define lookup strings in vector:
case2 <- c("H4703", "H90", "H91", "Q000", "Q001", "Q002", "Q01", "Q02", "Q03",
"Q04", "Q05", "Q070", "Q110", "Q111", "Q112", "Q120", "Q122", "Q130", "Q138",
"Q139", "Q141", "Q142", "Q143", "Q148", "Q149", "Q16", "Q65", "Q66", "Q674",
"Q688", "Q743", "Q758", "Q759", "Q828")
2) Make data long and use pmatch
to identify and label partial matches.
library(tidyverse)
my_data2 <- my_data %>%
gather(question, value, -record) %>%
mutate(first_match = case2[pmatch(value, case2, duplicates.ok = TRUE)])
> head(my_data2)
record question value first_match
1 1 icd1 Q753 <NA>
2 2 icd1 Q620 <NA>
3 3 icd1 Q825 <NA>
4 4 icd1 Q211 <NA>
5 5 icd1 Q828 Q828
6 6 icd1 Q6532 <NA>
3) Finally, identify which rows have at least one match:
my_data2 %>%
group_by(record) %>%
summarize(contains_string = any(!is.na(first_match)))
# A tibble: 20 x 2
record contains_string
<int> <lgl>
1 1 TRUE
2 2 FALSE
3 3 FALSE
4 4 FALSE
5 5 TRUE
6 6 FALSE
7 7 FALSE
8 8 FALSE
9 9 FALSE
10 10 FALSE
11 11 FALSE
12 12 FALSE
13 13 FALSE
14 14 FALSE
15 15 TRUE
16 16 FALSE
17 17 FALSE
18 18 FALSE
19 19 TRUE
20 20 FALSE
Upvotes: 0