dc.tv
dc.tv

Reputation: 123

Filter according to partial match of string variable in R

I have a data-frame with string variable column "disease". I want to filter the rows with partial match "trauma" or "Trauma". I am currently done the following using dplyr and stringr:

trauma_set <- df %>% filter(str_detect(disease, "trauma|Trauma"))

But the result also includes "Nontraumatic" and "nontraumatic". How can I filter only "trauma, Trauma, traumatic or Traumatic" without including nontrauma or Nontrauma? Also, is there a way I can define the string to detect without having to specify both uppercase and lowercase version of the string (as in both trauma and Trauma)?

Upvotes: 4

Views: 5743

Answers (2)

seapen
seapen

Reputation: 394

You were very close to a correct solution, you just needed to add the "start of string" anchor ^, as follows:

trauma_set <- df %>% filter(str_detect(disease, "^trauma|^Trauma"))

Upvotes: 0

akrun
akrun

Reputation: 887048

If we want to specify the word boundary, use \\b at the start. Also, for different cases, we can use ignore_case = TRUE by wrapping with modifiers

library(dplyr)
library(stringr)
out <- df %>%
        filter(str_detect(disease, regex("\\btrauma", ignore_case = TRUE)))

sum(str_detect(out$disease, regex("^Non", ignore_case = TRUE)))
#[1] 0

data

set.seed(24)
df <- data.frame(disease = sample(c("Nontraumatic", "Trauma", 
 "Traumatic", "nontraumatic", "traumatic", "trauma"), 50 ,
        replace = TRUE), value = rnorm (50))

Upvotes: 4

Related Questions