Julia
Julia

Reputation: 23

Filter rows where any values in a vector are contained in a column

I have a dataset with a single column that contains multiple ICD-10 codes separate by spaces, eg

Identifier    Codes
1             A14 R17
2             R069 D136 B08
3             C11 K71 V91

I have a vector with the ICD-10 codes that are relevant to my analysis, eg goodcodes<-c("C11","A14","R17","O80"). I want to select rows from my dataset where the Codes column contains any of the codes in my vector, but does not need to exactly match a code in my vector.

Using medicalinfo<-filter(medicalinfo, Codes %in% goodcodes) returns only rows where a single matching code is listed in the Codes column. I could also filter based on a partial string, I only know how to do that for a single partial string, not all of those in my codes vector.

Is there a way to get all the rows where any of these codes are present in the column?

Upvotes: 2

Views: 553

Answers (2)

r2evans
r2evans

Reputation: 160417

One trick is to combine the goodcodes into a regular expression:

library(dplyr)
ptn <- paste0("\\b(", paste(goodcodes, collapse = "|"), ")\\b")
ptn
# [1] "\\b(C11|A14|R17|O80)\\b"

FYI, the \\b( and )\\b are absolutely necessary if there's a chance that you will have codes A10 and A101; without \\b(...)\\b, then grepl("A10", "A101") will be a false-positive. See

grepl("A10|B20", "A101")
# [1] TRUE
grepl("\\b(A10|B20)\\b", "A101")
# [1] FALSE

Finally, let's use that ptn:

dat %>%
  filter(grepl(ptn, Codes))
#   Identifier         Codes
# 1          1       A14 R17
# 2          3   C11 K71 V91

Another way is to split the Codes column into a list of individual codes, and look for membership with %in%:

sapply(strsplit(trimws(dat$Codes), "\\s+"), function(a) any(a %in% goodcodes))
# [1]  TRUE FALSE  TRUE

Depending on how complex things are, a third way is to "unnest" Codes and look for matches.

dat %>%
  mutate(Codes = strsplit(trimws(Codes), "\\s+")) %>%
  tidyr::unnest(Codes) %>%
  group_by(Identifier) %>%
  filter(any(Codes %in% goodcodes)) %>%
  ungroup()
# # A tibble: 5 x 2
#   Identifier Codes
#        <dbl> <chr>
# 1          1 A14  
# 2          1 R17  
# 3          3 C11  
# 4          3 K71  
# 5          3 V91  

(If you really prefer them combined into a single space-delimited string as before, that's easy enough to do with group_by(Identifier) %>% summarize(Codes = paste(Codes, collapse = " ")). I don't recommend it, per se, since I prefer to have that type of information broken out like this, but there is likely context I don't know.)

Upvotes: 1

akrun
akrun

Reputation: 887008

With subset from base R. Loop over the 'goodcodes' vector, use that as pattern in grepl, Reduce the list of logical vectors into a single logical vector to subset the rows

subset(dat,  Reduce(`|`, lapply(goodcodes, function(x) grepl(x, Codes))))
#   Identifier       Codes
#1          1     A14 R17
#3          3 C11 K71 V91

data

dat <- structure(list(Identifier = 1:3, Codes = c("A14 R17", "R069 D136 B08", 
"C11 K71 V91")), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 0

Related Questions