Reputation: 7313
It seems grep is "greedy" in the way it returns matches. Assuming I've the following data:
Sources <- c(
"Coal burning plant",
"General plant",
"coalescent plantation",
"Charcoal burning plant"
)
Registry <- seq(from = 1100, to = 1103, by = 1)
df <- data.frame(Registry, Sources)
If I perform grep("(?=.*[Pp]lant)(?=.*[Cc]oal)", df$Sources, perl = TRUE, value = TRUE)
, it returns
"Coal burning plant"
"coalescent plantation"
"Charcoal burning plant"
However, I only want to return exact match, i.e. only where "coal" and "plant" occur. I don't want "coalescent", "plantation" and so on. So for this, I only want to see "Coal burning plant"
Upvotes: 6
Views: 7678
Reputation: 70750
You want to use word boundaries \b
around your word patterns. A word boundary does not consume any characters. It asserts that on one side there is a word character, and on the other side there is not. You may also want to consider using the inline (?i)
modifier for case-insensitive matching.
grep('(?i)(?=.*\\bplant\\b)(?=.*\\bcoal\\b)', df$Sources, perl=T, value=T)
Upvotes: 8
Reputation: 206566
If you always want the order "coal" then "plant", then this should work
grep("\\b[Cc]oal\\b.*\\b[Pp]lant\\b", Sources, perl = TRUE, value=T)
Here we add \b
match which stands for a word boundary. You can add the word boundaries to your original attempt we well
grep("(?=.*\\b[Pp]lant\\b)(?=.*\\b[Cc]oal\\b)", Sources,
perl = TRUE, value = TRUE)
Upvotes: 2