Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Issue with negative lookbehind in R

I have this set of sentences:

w <- c("so i said er well it would n't surprise me if it could bloody talk",  # quote marker
        "we got fifteen, well thirteen minutes",                              
        "well she brought a pie and she brought some er punch round",         
        "so your dad said well have n't i been soft ?",                       # quote marker
        "And he went [pause] well I can't feel any. ",                        # quote marker
        "I goes well they'll improve the grant to start off with",            # quote marker
        "so with the chips as well this is about one sixty .",                
        "well we 're not all the same are we , but") 

All strings contain the word well. I'm interested in those strings where well acts as a quote marker, as indicated by the occurrence of said, goes, and went. Using positive lookbehind I can match these sentences:

grep("(?<=said|goes|went).*well", w, value = T, perl = T)
[1] "so i said er well it would n't surprise me if it could bloody talk"
[2] "so your dad said well have n't i been soft ?"                      
[3] "And he went [pause] well I can't feel any. "                       
[4] "I goes well they'll improve the grant to start off with"

The issue I have is that negative lookbehind to match those string in which 'well' is not a quote marker does not work. For example, this matches everything:

grep("(?<!said|goes|went).*well", w, value = T, perl = T)
[1] "so i said er well it would n't surprise me if it could bloody talk" # not match
[2] "we got fifteen, well thirteen minutes"                              # match
[3] "well she brought a pie and she brought some er punch round"         # match    
[4] "so your dad said well have n't i been soft ?"                       # not match         
[5] "And he went [pause] well I can't feel any. "                        # not match             
[6] "I goes well they'll improve the grant to start off with"            # not match         
[7] "so with the chips as well this is about one sixty ."                # match      
[8] "well we 're not all the same are we , but"                          # match

Why doesn't it match correctly and how would it have to be changed to match correctly?

Thanks in advance!

Upvotes: 1

Views: 41

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

That happens because (?<!said|goes|went) matches a location in the string that is not immediately preceded with the strings defined in the lookbehind. .* then matches any 0+ chars other than line break chars as many as possible and then well is matched. There are a lot of such valid positions.

The easiest is to match those strings where said, goes or went occur before well and skip them, then match well in all other contexts:

\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)|\bwell\b

See the regex demo.

Caution: If you use a solution like ^(?!.*\b(?:said|goes|went)\b).*\bwell\b, you may get false negatives when said, goes or went appear after well.

Pattern details

  • \b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F) - a whole word: said, goes or went and then any 0 or more chars as many as possible and then a whole word well, after this match is found, it is dropped and the regex engine starts looking for a match at the current failed position
  • | - or
  • \bwell\b - a whole word well.

See an R demo:

grep("\\b(?:said|goes|went)\\b.*\\bwell\\b(*SKIP)(*F)|\\bwell\\b", w, value = TRUE, perl = TRUE)
# [1] "we got fifteen, well thirteen minutes"                     
# [2] "well she brought a pie and she brought some er punch round"
# [3] "so with the chips as well this is about one sixty ."       
# [4] "well we 're not all the same are we , but"    

Upvotes: 2

Related Questions