Reputation: 23
I have a dataset with a single column that contains multiple ICD-10 codes separate by spaces, eg
Identifier Codes
1 A14 R17
2 R069 D136 B08
3 C11 K71 V91
I have a vector with the ICD-10 codes that are relevant to my analysis, eg goodcodes<-c("C11","A14","R17","O80")
. I want to select rows from my dataset where the Codes column contains any of the codes in my vector, but does not need to exactly match a code in my vector.
Using medicalinfo<-filter(medicalinfo, Codes %in% goodcodes)
returns only rows where a single matching code is listed in the Codes column. I could also filter based on a partial string, I only know how to do that for a single partial string, not all of those in my codes vector.
Is there a way to get all the rows where any of these codes are present in the column?
Upvotes: 2
Views: 553
Reputation: 160417
One trick is to combine the goodcodes
into a regular expression:
library(dplyr)
ptn <- paste0("\\b(", paste(goodcodes, collapse = "|"), ")\\b")
ptn
# [1] "\\b(C11|A14|R17|O80)\\b"
FYI, the \\b(
and )\\b
are absolutely necessary if there's a chance that you will have codes A10
and A101
; without \\b(...)\\b
, then grepl("A10", "A101")
will be a false-positive. See
grepl("A10|B20", "A101")
# [1] TRUE
grepl("\\b(A10|B20)\\b", "A101")
# [1] FALSE
Finally, let's use that ptn
:
dat %>%
filter(grepl(ptn, Codes))
# Identifier Codes
# 1 1 A14 R17
# 2 3 C11 K71 V91
Another way is to split the Codes
column into a list of individual codes, and look for membership with %in%
:
sapply(strsplit(trimws(dat$Codes), "\\s+"), function(a) any(a %in% goodcodes))
# [1] TRUE FALSE TRUE
Depending on how complex things are, a third way is to "unnest" Codes
and look for matches.
dat %>%
mutate(Codes = strsplit(trimws(Codes), "\\s+")) %>%
tidyr::unnest(Codes) %>%
group_by(Identifier) %>%
filter(any(Codes %in% goodcodes)) %>%
ungroup()
# # A tibble: 5 x 2
# Identifier Codes
# <dbl> <chr>
# 1 1 A14
# 2 1 R17
# 3 3 C11
# 4 3 K71
# 5 3 V91
(If you really prefer them combined into a single space-delimited string as before, that's easy enough to do with group_by(Identifier) %>% summarize(Codes = paste(Codes, collapse = " "))
. I don't recommend it, per se, since I prefer to have that type of information broken out like this, but there is likely context I don't know.)
Upvotes: 1
Reputation: 887008
With subset
from base R
. Loop over the 'goodcodes' vector, use that as pattern in grepl
, Reduce
the list
of logical vector
s into a single logical vector
to subset
the rows
subset(dat, Reduce(`|`, lapply(goodcodes, function(x) grepl(x, Codes))))
# Identifier Codes
#1 1 A14 R17
#3 3 C11 K71 V91
dat <- structure(list(Identifier = 1:3, Codes = c("A14 R17", "R069 D136 B08",
"C11 K71 V91")), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 0