pophealth
pophealth

Reputation: 11

Match multiple strings across multiple multiple columns and create a single yes/no (1/0) column

Edited to add code:

I am trying to replicate some work from a colleague that uses SAS. We're having an issue with the import in SAS which converts text (which matches boolean) to numeric.

The purpose of this work is to identify particular records to pass on, so we need the values to be preserved as originally imported (something I think R will be able to do). Right now we're fixing the issue manually because it's a small number of records but that may not always be true.

Where I'm hitting a snag is that I need to replicate their matrix array in R. There are multiple conditions that should be flagged with a 1 if they meet the condition, as follows: SAS Code

I need to be able to evaluate if there is one of 34 potential strings (or partial strings (in SAS, the colon shortens a comparison value to the same length as the evaluation value and compares them) in one of 12 columns (e.g. :Q16 means the string only need start with Q16). Additionally, any one of the 12 could have a value through it does get sparser in later fields.

I am trying to find the most efficient and compact approach, if possible.

I'm still somewhat new at R for more complex problems so I am stymied. I've tried a few approaches with grep and grepl but none have born any fruit. When I tried regex, I tried using each string individually in ifelse and then I also tried one larger string with the "|" operator but no luck either. I also tried base (apply) and dplyr approaches.

Any help is appreciated.

The structure of the data is: Example Table

Code for Example Data:

structure(list(record = 1:20, 
icd1 = c("Q753", "Q620", "Q825", "Q211", "Q828", "Q6532", "Q673", "Q380", "Q5310", "Q040", "Q107", "Q6689", "Q860", "Q753", "Q000", "Q673", "Q860", "Q673", "H9190", "Q381"), 
icd2 = c("Q141",NA,NA, "Q170", NA, NA, NA, NA, NA, NA, NA, "Q211", NA, NA, "Q211", "Q673", NA, "115", "Q759", "Q753"), 
icd3 = c("Q579", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Q038", "H4657", "Q211"), 
icd4 = c("Q656", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Q999", NA, NA), 
icd5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Q5301", NA, NA), 
icd6 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Q168", NA, NA), 
icd7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
icd8 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
icd9 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
icd10 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
icd11 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
icd12 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), 
.Names = c("record", "icd1", "icd2", "icd3", "icd4", "icd5", "icd6", "icd7", "icd8", "icd9", "icd10", "icd11", "icd12"), 
class = "data.frame", row.names = c(NA, -20L))

Strings of Interest:

case2 <- "^H4703| ^H90*| ^H91*| ^Q000| ^Q001| ^Q002| ^Q01*| ^Q02| ^Q03*|
^Q04*| ^Q05*| ^Q070*| ^Q110| ^Q111| ^Q112| ^Q120| ^Q122| ^Q130| ^Q138|
^Q139| ^Q141| ^Q142| ^Q143| ^Q148| ^Q149| ^Q16*| ^Q65*| ^Q66*| ^Q674| 
^Q688| ^Q743| ^Q758| ^Q759| ^Q828" 

Upvotes: 1

Views: 202

Answers (3)

Jack Wasey
Jack Wasey

Reputation: 3440

It looks like you are reinventing comorbidity assignment, for which there are several R packages. I created icd about six years ago for this purpose and it is widely used in the community. It is accurate and very fast. For your application, I would create a custom comorbidity map (just a named list, with each member being a vector of matching code stems).

library(icd)
# x <- your_data_structure_above
examplemap <- list(
  case2 = c("H4703", "H90", "H91", "Q000", "Q001", "Q002", "Q01", "Q02", 
            "Q03", "Q04", "Q05", "Q070", "Q110", "Q111", "Q112", "Q120", 
            "Q122", "Q130", "Q138", "Q139", "Q141", "Q142", "Q143", "Q148", 
            "Q149", "Q16", "Q65", "Q66*", "Q674", "Q688", "Q743", "Q758", 
            "Q759", "Q828"))
icd::comorbid(x, examplemap)

Upvotes: 0

AndS.
AndS.

Reputation: 8110

For this, I modified your string a bit. In short, i converted your dataframe from wide to long, then I summarized each column as either having (TRUE) or not having (FALSE) any of the strings you wanted.

case2 <- "H4703|H90|H91|Q000|Q001|Q002|Q01|Q02|Q03|Q04|Q05|Q070|Q110|Q111|Q112|Q120|Q122|Q130|Q138|Q139|Q141|Q142|Q143|Q148|Q149|Q16|Q65|Q66|Q674|Q688|Q743|Q758|Q759|Q828" 

library(dplyr)
library(tidyr)

df %>% 
  gather(column, string, -record) %>%
  group_by(column) %>%
  summarise(contains_string = sum(grepl(case2, string))>0) 
#> # A tibble: 12 x 2
#>    column contains_string
#>    <chr>  <lgl>          
#>  1 icd1   TRUE           
#>  2 icd10  FALSE          
#>  3 icd11  FALSE          
#>  4 icd12  FALSE          
#>  5 icd2   TRUE           
#>  6 icd3   TRUE           
#>  7 icd4   TRUE           
#>  8 icd5   FALSE          
#>  9 icd6   TRUE           
#> 10 icd7   FALSE          
#> 11 icd8   FALSE          
#> 12 icd9   FALSE

EDIT

Here is for each row.

df %>% 
  gather(column, string, -record) %>%
  group_by(record) %>%
  mutate(contains_string = sum(grepl(case2, string))>0) %>%
  spread(column, string)
#> # A tibble: 20 x 14
#> # Groups:   record [20]
#>    record contains_string icd1  icd10 icd11 icd12 icd2  icd3  icd4  icd5 
#>     <int> <lgl>           <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1      1 TRUE            Q753  <NA>  <NA>  <NA>  Q141  Q579  Q656  <NA> 
#>  2      2 FALSE           Q620  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  3      3 FALSE           Q825  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  4      4 FALSE           Q211  <NA>  <NA>  <NA>  Q170  <NA>  <NA>  <NA> 
#>  5      5 TRUE            Q828  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  6      6 TRUE            Q6532 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  7      7 FALSE           Q673  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  8      8 FALSE           Q380  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  9      9 FALSE           Q5310 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 10     10 TRUE            Q040  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 11     11 FALSE           Q107  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 12     12 TRUE            Q6689 <NA>  <NA>  <NA>  Q211  <NA>  <NA>  <NA> 
#> 13     13 FALSE           Q860  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 14     14 FALSE           Q753  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 15     15 TRUE            Q000  <NA>  <NA>  <NA>  Q211  <NA>  <NA>  <NA> 
#> 16     16 FALSE           Q673  <NA>  <NA>  <NA>  Q673  <NA>  <NA>  <NA> 
#> 17     17 FALSE           Q860  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 18     18 TRUE            Q673  <NA>  <NA>  <NA>  115   Q038  Q999  Q5301
#> 19     19 TRUE            H9190 <NA>  <NA>  <NA>  Q759  H4657 <NA>  <NA> 
#> 20     20 FALSE           Q381  <NA>  <NA>  <NA>  Q753  Q211  <NA>  <NA> 
#> # ... with 4 more variables: icd6 <chr>, icd7 <chr>, icd8 <chr>,
#> #   icd9 <chr>

Created on 2018-09-19 by the reprex package (v0.2.0).

Upvotes: 1

Jon Spring
Jon Spring

Reputation: 66425

Here's a similar approach as @AndS, using pmatch to get partial matches.

1) Define lookup strings in vector:

case2 <- c("H4703", "H90", "H91", "Q000", "Q001", "Q002", "Q01", "Q02", "Q03",
"Q04", "Q05", "Q070", "Q110", "Q111", "Q112", "Q120", "Q122", "Q130", "Q138",
"Q139", "Q141", "Q142", "Q143", "Q148", "Q149", "Q16", "Q65", "Q66", "Q674", 
"Q688", "Q743", "Q758", "Q759", "Q828")

2) Make data long and use pmatch to identify and label partial matches.

library(tidyverse)
my_data2 <- my_data %>%
  gather(question, value, -record) %>%
  mutate(first_match = case2[pmatch(value, case2, duplicates.ok = TRUE)])

> head(my_data2)
  record question value first_match
1      1     icd1  Q753        <NA>
2      2     icd1  Q620        <NA>
3      3     icd1  Q825        <NA>
4      4     icd1  Q211        <NA>
5      5     icd1  Q828        Q828
6      6     icd1 Q6532        <NA>

3) Finally, identify which rows have at least one match:

my_data2 %>%
  group_by(record) %>%
  summarize(contains_string = any(!is.na(first_match)))

# A tibble: 20 x 2
   record contains_string
    <int> <lgl>          
 1      1 TRUE           
 2      2 FALSE          
 3      3 FALSE          
 4      4 FALSE          
 5      5 TRUE           
 6      6 FALSE          
 7      7 FALSE          
 8      8 FALSE          
 9      9 FALSE          
10     10 FALSE          
11     11 FALSE          
12     12 FALSE          
13     13 FALSE          
14     14 FALSE          
15     15 TRUE           
16     16 FALSE          
17     17 FALSE          
18     18 FALSE          
19     19 TRUE           
20     20 FALSE

Upvotes: 0

Related Questions