R grep and exact matches

Question

It seems grep is "greedy" in the way it returns matches. Assuming I've the following data:

Sources <- c(
                "Coal burning plant",
                "General plant",
                "coalescent plantation",
                "Charcoal burning plant"
        )

Registry <- seq(from = 1100, to = 1103, by = 1)

df <- data.frame(Registry, Sources)

If I perform grep("(?=.*[Pp]lant)(?=.*[Cc]oal)", df$Sources, perl = TRUE, value = TRUE), it returns

"Coal burning plant"     
"coalescent plantation"  
"Charcoal burning plant"

However, I only want to return exact match, i.e. only where "coal" and "plant" occur. I don't want "coalescent", "plantation" and so on. So for this, I only want to see "Coal burning plant"

hwnd · Accepted Answer

You want to use word boundaries \b around your word patterns. A word boundary does not consume any characters. It asserts that on one side there is a word character, and on the other side there is not. You may also want to consider using the inline (?i) modifier for case-insensitive matching.

grep('(?i)(?=.*\bplant\b)(?=.*\bcoal\b)', df$Sources, perl=T, value=T)

Working Demo

R grep and exact matches

Answers (2)

Related Questions