Reputation: 123
I have a data-frame with string variable column "disease". I want to filter the rows with partial match "trauma" or "Trauma". I am currently done the following using dplyr
and stringr
:
trauma_set <- df %>% filter(str_detect(disease, "trauma|Trauma"))
But the result also includes "Nontraumatic" and "nontraumatic". How can I filter only "trauma, Trauma, traumatic or Traumatic" without including nontrauma or Nontrauma? Also, is there a way I can define the string to detect without having to specify both uppercase and lowercase version of the string (as in both trauma and Trauma)?
Upvotes: 4
Views: 5743
Reputation: 394
You were very close to a correct solution, you just needed to add the "start of string" anchor ^
, as follows:
trauma_set <- df %>% filter(str_detect(disease, "^trauma|^Trauma"))
Upvotes: 0
Reputation: 887048
If we want to specify the word boundary, use \\b
at the start. Also, for different cases, we can use ignore_case = TRUE
by wrapping with modifiers
library(dplyr)
library(stringr)
out <- df %>%
filter(str_detect(disease, regex("\\btrauma", ignore_case = TRUE)))
sum(str_detect(out$disease, regex("^Non", ignore_case = TRUE)))
#[1] 0
set.seed(24)
df <- data.frame(disease = sample(c("Nontraumatic", "Trauma",
"Traumatic", "nontraumatic", "traumatic", "trauma"), 50 ,
replace = TRUE), value = rnorm (50))
Upvotes: 4