sedeh
sedeh

Reputation: 7313

R grep and exact matches

It seems grep is "greedy" in the way it returns matches. Assuming I've the following data:

Sources <- c(
                "Coal burning plant",
                "General plant",
                "coalescent plantation",
                "Charcoal burning plant"
        )

Registry <- seq(from = 1100, to = 1103, by = 1)

df <- data.frame(Registry, Sources)

If I perform grep("(?=.*[Pp]lant)(?=.*[Cc]oal)", df$Sources, perl = TRUE, value = TRUE), it returns

"Coal burning plant"     
"coalescent plantation"  
"Charcoal burning plant" 

However, I only want to return exact match, i.e. only where "coal" and "plant" occur. I don't want "coalescent", "plantation" and so on. So for this, I only want to see "Coal burning plant"

Upvotes: 6

Views: 7678

Answers (2)

hwnd
hwnd

Reputation: 70750

You want to use word boundaries \b around your word patterns. A word boundary does not consume any characters. It asserts that on one side there is a word character, and on the other side there is not. You may also want to consider using the inline (?i) modifier for case-insensitive matching.

grep('(?i)(?=.*\\bplant\\b)(?=.*\\bcoal\\b)', df$Sources, perl=T, value=T)

Working Demo

Upvotes: 8

MrFlick
MrFlick

Reputation: 206566

If you always want the order "coal" then "plant", then this should work

grep("\\b[Cc]oal\\b.*\\b[Pp]lant\\b", Sources, perl = TRUE, value=T)

Here we add \b match which stands for a word boundary. You can add the word boundaries to your original attempt we well

grep("(?=.*\\b[Pp]lant\\b)(?=.*\\b[Cc]oal\\b)", Sources, 
    perl = TRUE, value = TRUE)

Upvotes: 2

Related Questions