Reputation: 21400
I have this set of sentences:
w <- c("so i said er well it would n't surprise me if it could bloody talk", # quote marker
"we got fifteen, well thirteen minutes",
"well she brought a pie and she brought some er punch round",
"so your dad said well have n't i been soft ?", # quote marker
"And he went [pause] well I can't feel any. ", # quote marker
"I goes well they'll improve the grant to start off with", # quote marker
"so with the chips as well this is about one sixty .",
"well we 're not all the same are we , but")
All strings contain the word well
. I'm interested in those strings where well
acts as a quote marker, as indicated by the occurrence of said
, goes
, and went
. Using positive lookbehind I can match these sentences:
grep("(?<=said|goes|went).*well", w, value = T, perl = T)
[1] "so i said er well it would n't surprise me if it could bloody talk"
[2] "so your dad said well have n't i been soft ?"
[3] "And he went [pause] well I can't feel any. "
[4] "I goes well they'll improve the grant to start off with"
The issue I have is that negative lookbehind to match those string in which 'well' is not a quote marker does not work. For example, this matches everything:
grep("(?<!said|goes|went).*well", w, value = T, perl = T)
[1] "so i said er well it would n't surprise me if it could bloody talk" # not match
[2] "we got fifteen, well thirteen minutes" # match
[3] "well she brought a pie and she brought some er punch round" # match
[4] "so your dad said well have n't i been soft ?" # not match
[5] "And he went [pause] well I can't feel any. " # not match
[6] "I goes well they'll improve the grant to start off with" # not match
[7] "so with the chips as well this is about one sixty ." # match
[8] "well we 're not all the same are we , but" # match
Why doesn't it match correctly and how would it have to be changed to match correctly?
Thanks in advance!
Upvotes: 1
Views: 41
Reputation: 626758
That happens because (?<!said|goes|went)
matches a location in the string that is not immediately preceded with the strings defined in the lookbehind. .*
then matches any 0+ chars other than line break chars as many as possible and then well
is matched. There are a lot of such valid positions.
The easiest is to match those strings where said
, goes
or went
occur before well
and skip them, then match well
in all other contexts:
\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)|\bwell\b
See the regex demo.
Caution: If you use a solution like ^(?!.*\b(?:said|goes|went)\b).*\bwell\b
, you may get false negatives when said
, goes
or went
appear after well
.
Pattern details
\b(?:said|goes|went)\b.*\bwell\b(*SKIP)(*F)
- a whole word: said
, goes
or went
and then any 0 or more chars as many as possible and then a whole word well
, after this match is found, it is dropped and the regex engine starts looking for a match at the current failed position|
- or\bwell\b
- a whole word well
.See an R demo:
grep("\\b(?:said|goes|went)\\b.*\\bwell\\b(*SKIP)(*F)|\\bwell\\b", w, value = TRUE, perl = TRUE)
# [1] "we got fifteen, well thirteen minutes"
# [2] "well she brought a pie and she brought some er punch round"
# [3] "so with the chips as well this is about one sixty ."
# [4] "well we 're not all the same are we , but"
Upvotes: 2