Reputation: 57
This is using the .NET regex engine.
I am attempting to use a conditional inside of a lookbehind clause. When the expression is used outside of the lookbehind it behaves as I expect -- but when placed in the lookbehind, it has a different behavior.
Here is a simple example of the effect to try to duplicate the issue.
Matching:
good morning
with the regular expression:
(?<=(?(?=go)good|bad)\s)morning
yields no match.
When tried without the look behind:
(?(?=go)good|bad)\smorning
I get a match on "good morning"
By fiddling around, I discovered that the lookahead cursor location, when it is inside the lookbehind, is after the word "good":
(?<=(?(?=\smor)good|bad)\s)morning
This matches "morning".
My question is is this expected or some kind of bug?
Obviously this example is not real world - the problem that I was trying to solve when I stumbled on this issue is as follows: The expression uses a conditional to determine the length of the next word, then uses two different sets of rules for matching on that word. Similar to:
(?<=\s+(?(?=[^\s]{1,2}\s)[A-Z0-9]+|(?![A-Z]+\s)[0-9-A-Z/"']+))\s+matching\s+text
This matches the "matching text" only if a one or two letter word consisting of letters and numbers, or a longer word not consisting of only letters but can contain numbers, letters, slashes, dashes, quotes and apostrophes.
The following should match "matching text":
1 matching text
a matching text
It only matches on the first one, because the conditional evaluated to false (it was looking at the " matching" instead of "a") and the negative look ahead searching for a word consisting of all letters failed on the "a".
Further examples:
Must match "matching text":
123-1 matching text
9B matching text
15/16 matching text
"45" matching text
A matching text
AA matching text
A1 matching text
Must not match "matching text"
and matching text
" matching text
A- matching text
Upvotes: 0
Views: 4151
Reputation: 3166
I've changed your example a little, so that it can be used in PCRE for comparison. PCRE requires a fixed width lookbehind. So both alternatives are 4 characters wide. I've changed the first alternative to match 4 characters go..
and the second alternative to match any 4 characters ....
.
These are the strings to be matched in the examples. I'm deliberately using underscores instead of spaces for them to show up visibly in capturing groups:
good_morning
mild_morning
cold_morning
gold_good_morning
With almost everything in named capturing groups, it's easier to see what is matched by which part of the expression. First I'll not use a conditional, but an alternation in the lookbehind. The group name del
is just short for "delimiter":
(?<= # either or
( (?=(?<look>.*)) (?<either>go..) | (?<or>....) ) # $1
(?<del>.) # One width character
)
morning # The match
This matches "morning" on all 4 strings. The interesting bits are what's in the capturing groups. I used Regex Storm to get the captured groups:
Index | Position | Matched String | $1 | ${look} | ${either} | ${or} | ${del} |
---|---|---|---|---|---|---|---|
0 | 5 | morning | good | good_morning | good | empty string | _ |
1 | 19 | morning | mild | empty string | empty string | mild | _ |
2 | 33 | morning | cold | empty string | empty string | cold | _ |
3 | 52 | morning | good | good_morning | good | empty string | _ |
Since the lookahead is associated with the first alternative, it's no surprise that it's empty when the second alternative is being matched. But when it matches, it matches the entire string. Almost. If ((?<del>.) # One width character
is changed to (?<del>.*) # As mush as possible
, the last row becomes:
Index | Position | Matched String | $1 | ${look} | ${either} | ${or} | ${del} |
---|---|---|---|---|---|---|---|
3 | 52 | morning | gold | gold_good_morning | gold | empty string | _good_ |
It seems as though the lookahead cursor is offset from the main matching regex cursor.
According to Jan Goyvaerts's Lookahead and Lookbehind Zero-Length Assertions
the JGsoft engine and the .NET framework RegEx classes. These regex engines really apply the regex inside the lookbehind backwards, going through the regex inside the lookbehind and through the subject string from right to left.
and RexEgg.com's Mastering Lookahead and Lookbehind when going through the pattern (?<=(\d{1,5}))Z
on the string "123Z".
.NET captures 123 to Group 1. The .NET engine has a far more efficient way of processing variable-width lookbehinds. Instead of trying multiple fixed-width patterns starting at points further and further back in the string, .NET reverses the string as well as the pattern inside the lookbehind, then attempts to match that single pattern on the reversed string. Therefore, in 123Z, to try the lookbehind at the point before Z, it reverses the portion of string to be tested from 123 to 321. Likewise, the lookbehind
(?<=(\d{1,5}))
is flipped into the lookahead(?=(\d{1,5}))
.\d{1,5}
matches 321.
Reversing "good_morning" gives "gninrom_doog". For the regex to match "morning", the main regex cursor must be just between "_" and "m". I'm using a space to illustate between:
gninrom _doog
^
Reversing parts of the lookbehind:
(?= # lookahead
(?<del>.) # One width character
...
)
The alternation must not have been reversed, since the second alternative matches any 4 characters and if that had been changed to be the first alternative, the groups look
and either
would be empty on all matches.
( (?<either>..og) | (?<or>....) )
The look
lookahead captures the entire string including "good", so it can only have become a lookbehind after the first alternative:
( (?<either>..og) (?<=(?<look>.*)) | (?<or>....) )
That gives:
(?= # lookahead
(?<del>.) # One width character
# either or
( (?<either>..og) (?<=(?<look>.*)) | (?<or>....) )
)
Going through this regex for the reversed string will natually yeild the result seen in the results. Going through it for "gold_good_morning":
gninrom _doog_dlog
^
The del
first takes a character, which is the "_"
gninrom_ doog_dlog
^
The either
group captures "doog"
gninrom_doog _dlog
----^
The look
captures the string from the current cursor all the way back:
gninrom_doog _dlog
------------^
Once the lookahead (reversed lookbehind) is done, the string and the captures are reversed back.
Adding the Kleene star to del
, the (?<del>.*)
will greedily go to the end of the string. Then it will backtrack 4 characters and match (?<either>..og)
on the string "dlog" before capturing everything into (?<=(?<look>.*))
.
Adding the conditional instead of just the alternation, putting the lookahead into the condition:
(?<= # if then else
(?(?<=(?<if>.*)) (?<then>go..) | (?<else>....) )
(?<del>.) # One width character
)
morning # The match
Gives two matches. The condition if
is always true and the then
-part only mathes words starting with "go"
Index | Position | Matched String | ${if} | ${then} | ${else} | ${del} |
---|---|---|---|---|---|---|
0 | 19 | morning | _morning | good | empty string | _ |
1 | 52 | morning | _morning | good | empty string | _ |
Changing the (?<del>.) # One width character
to (?<del>.*) # As mush as possible
, the last row becomes:
Index | Position | Matched String | ${if} | ${then} | ${else} | ${del} |
---|---|---|---|---|---|---|
1 | 52 | morning | _good_morning | gold | empty string | good |
The interesting part is the capturing group if
. It doesn't contains the entire string. That indicates that the conditional isn't reversed by being evaluated last, but instead it is still being evaluated at the beginning of that part of the reserved expression. Though the direction of the lookaround has been reversed:
(?= (?<del>.) # One width character
# if then else
(?(?<=(?<if>.*)) (?<then>..og) | (?<else>....) )
)
Looking at how that will process "gold_good_morning", the capturing groups makes sense.
gninrom _doog_dlog
^
The del
first takes a character, which is the "_"
gninrom_ doog_dlog
^
The if
lookbehind captures everything from the cursor, which now includes the "_"
gninrom_ doog_dlog
--------^
Since the condition is true, the then
part is executed capturing "doog"
gninrom_doog _dlog
----^
Adding the Kleene star to del
, the (?<del>.*)
will again greedily go to the end of the string. But it's forced to backtrack 4 characters for either of then
or else
to make a match.
In order for PCRE to accept the alternation as fixed-width, I added a lookahead for the second alternative too:
(?<= # either or
( (?=(?<lookeither>.*)) (?<either>go..) | (?=(?<lookor>.*)) (?<or>....) ) # $1
(?<del>.) # One width character
)
morning # The match
Matched String | $1 | lookeither | either | lookor | or | del |
---|---|---|---|---|---|---|
morning | good | good_morning | good | null | null | _ |
morning | mild | null | null | mild_morning | mild | _ |
morning | cold | null | null | cold_morning | cold | _ |
morning | good | good_morning | good | null | null | _ |
At regex101
Instead of reversing the string, it counts back the 5 characters that is to be consumed inside the lookbehind and does its work from here.
Since PCRE doesn't support variable length lookbehinds, it's not possible to use the Kleene star on del
, but using (?<del>.{6}) # Fixed six width character
(to account for "_good_") gives one result. It can only give the one result as it has to backtrack a total of 10 characters. Only the string "gold_good_morning" makes that possible:
Matched String | $1 | lookeither | either | lookor | or | del |
---|---|---|---|---|---|---|
morning | good | gold_good_morning | good | null | null | good |
(?<= # if then else
(?(?=(?<if>.*)) (?<then>go..) | (?<else>....) )
(?<del>.) # One width character
)
morning # The match
gives this:
Matched String | if | then | else | del |
---|---|---|---|---|
morning | good_morning | good | empty string | _ |
morning | good_morning | good | empty string | _ |
At regex101
Since nothing is reversed, it's working as you expected it to work.
Using (?<del>.{6}) # Fixed six width character
gives:
Matched String | if | then | else | del |
---|---|---|---|---|
morning | gold_good_morning | gold | empty string | _ |
I expect there's no surprises here.
If you want it to give you the same captures as PCRE, then I suppose you could call it a bug. But I'd argue that once you know how it works, you can tailor your expressions to make use of it, and then call it a feature. I'd argue that non-fixed width lookbehinds certainly is a feature.
Knowing that the .NET regex engine reverses the direction of the lookaround, you can simple just reverse the lookaround look
, so that it's initally opposite. Sort of like backing in reverse is actually going forward :)
(?<= # if then else
(?(?<=(?<if>.*)) (?<then>go..) | (?<else>....) )
(?<del>.) # One width character
)
morning # The match
will give you
Index | Position | Matched String | ${if} | ${then} | ${else} | ${del} |
---|---|---|---|---|---|---|
0 | 19 | morning | good | good | empty string | _ |
1 | 52 | morning | gold_good | good | empty string | _ |
Using your conditional for matching "morning" in "good morning", you can use
(?<=(?(?<=(go.*))good|bad)\s)morning
Obviously (?<=(go))
won't do, since that would be trying to match "og" against "do" in "doog". But .*og
is going to match "doog".
Upvotes: 1
Reputation: 57
I think I understand the issue now. The important difference between a conditional inside of a lookbehind and a conditional not inside a lookbehind is the timing of when the conditional is executed (or where the search cursor is at that point.
The example without a lookbehind:
(?(?=go)good|bad)\smorning
good morning
The conditional is run at the beginning of the search. So the search cursor is before the 'g' in 'good'.
good morning
^
So at this point the look ahead evaluates to TRUE since it sees matches on the 'go'
In the example with the lookbehind, the cursor is at a different location.
(?<=(?(?=go)good|bad)\s)morning
The search cursor finds the first required item in the text: 'morning'
good morning
^
The actual search cursor stays in place to consume 'morning' if it the lookbehind matches. The lookbehind will use its own cursor to verify what is before 'morning' to determine if this 'morning' is a match. The lookbehind states that there is a '\s' directly before 'morning' and indeed there is. The temporary lookbehind cursor moves to the space:
good morning
^^
Now it gets to the conditional and runs the lookahead in the conditional statement. At this point the lookahead is looking for 'go' but it sees ' morning'. So the conditional fails. The expression says to try to match on 'bad' (or dab backwards from the lookbehind cursor) but it sees good (or doog backwards from the lookbehind cursor). So no match for 'morning'.
Since the conditional is run at the end of the word of interest when it is in a lookbehind (instead of at the beginning of that word outside of a lookbehind), the secret is to reverse the conditional to a lookbehind instead of a lookahead:
(?<=(?(?<=good)good|bad)\s)morning
This actually matches on morning. It look nonsensical in this example to look for a word and then match it - but it illustrates the concept. In the real world case stated in the question, the solution looks like this:
(?<=(?(?<=\s\S{1,2})[A-Z0-9]+|(?![A-Z]+\s)[0-9-A-Z/"']+))\s+matching\stext
This looks for a word before matching text. The conditional checks to see if it is a one or two character word. If so, it should consist of only letters and numbers. If not, it must not only consist of letters. If this is met, it can consist of letters, numbers, dashes, slashes, quotes and apostrophes.
Two things changed:
What follows is some more analysis of the original problem. Enjoy!
@zx81 had an example that is actually better at illustrating what is going on. This example has more cursor movement so it does help illustrate what is happening:
(?<=(?(?=go)good|bad))\w+pher
badphilosopher <-- 'philosopher' matches
goodgopher <-- 'gopher' matches
badgopher <-- no match
There is a big difference in this example because the \w+ is used. So the regex engine immediately matches all the text in each example since the phrase has no white space and ends in 'pher'.
So for 'badphilosopher':
Lookbehind is run and the conditional is immediately run looking for 'go' but finds 'ba'
badphilosopher
^
The condition failed so it tries to match bad to the left of the cursor, but we are at the beginning of the phrase, so no match.
It checks again at these two cursor points because the '\w+pher' matches each time:
badphilosopher
^
But the lookbehind sees 'b' then 'ba'
When the cursor gets to:
badphilosopher
^
The conditonal again fails to find a 'go' (it sees 'ph') so it attempts to match 'bad' to the left of the cursor and finds it! So there fore the \w+pher matches the philosopher text.
goodgopher
^
goodgopher matches in a similar way except that the conditional is successful.
badgopher
^
badgopher does not match because the conditonal is successful but 'good' is not found to the left of the cursor.
Putting a space in there really changes things up, because the /w+pher no longer matches the entire string.
(?<=(?(?=go)good|bad)\s+)\w+pher
bad philosopher <-- matches philosopher
good gopher <-- no match
bad gopher <-- matches gopher
In this case the cursor moves through the string until it can match \w+pher:
bad philosopher
^
At this point it starts to process the lookbehind -- and sees that a '\s+' is required to the left of the search cursor -- it finds this and moves the temporary lookbehind cursor over.
bad philosopher
^^
The conditional is run now and sees looking for 'go' at the temp lookbehind cursor but finds ' p'. The failure means trying to match bad to the left of the temp lookbehind cursor and indeed it finds it there.
good gopher
^^
The 'good gopher' example gets to the conditional and sees ' g' so it fails and then looks for 'bad' to the left of the cursor and doesn't find it. So this fails.
bad philosopher
^^
Similarly, 'bad philosopher' gets to the conditonal and finds ' p' and looks for 'bad' to the left of the cursor and finds it. So it matches.
When run without the lookbehind, all of these examples match. This can be rather counterintuitive - but you have to take the location of the cursors into account in the lookbehind.
Upvotes: -3
Reputation: 41838
Your question is a great one to dive into lookarounds and lookaheads. It is a rich question, but to me it seems that the main question is this:
Why does (?<=(?(?=go)good|bad)\s)morning
fail against "good morning"?
Because you say "My question is is this expected or some kind of bug?" This is the question I will address.
In a nutshell, at a certain position x in the string, you want to say:
(?=go)
then immediatelyTherefore the regex cannot match "good morning". QED.
Let's now take a look at a correct program with nearly the same expression. It successfully matches "goodgopher" and "badphilosopher". Once you understand why this works, you will understand why the other does not.
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
// Simplest
string s1 = "badphilosopher";
string s2 = "goodgopher";
string s3 = "badgopher";
string pattern = @"(?<=(?(?=go)good|bad))\w+pher";
Console.WriteLine(Regex.IsMatch(s1, pattern) );
Console.WriteLine(Regex.IsMatch(s2, pattern));
Console.WriteLine(Regex.IsMatch(s3, pattern));
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
The output:
True
True
False
The key is to understands that a lookaround asserts: "at this exact position in the string, I am followed (or preceded) by..."
When it is evaluated, the lookaround is firmly in place in one part of the string. If you string several lookarounds in a row, that position does not jump around.
So how does the above regex work?
At the current position in the string, we assert:
If at this position in the string, if we are preceded by x
case 1: if at this position in the string what follows is "go"
then x is good
case 2: else
then x is bad
Then match any number of word characters followed by philosopher
To get absolutely clear on lookahead and lookbehind, you may want to read this page about regex lookarounds.
Sub-Question in Comment:
(?<=(?(?=go)good|bad))\s+\w+pher
matches on "bad philosopher" and "bad gopher". Why?
Here is why. In either "bad philosopher" or "bad gopher", place yourself just after the "d". This is your position x. At this position,
\s+\w+pher
, as this is exactly what is in front of you(?=go)
situation but in the else, and you must assert "What immediately precedes position x is "bad". Is that true? Yes.But... What Does the Engine Really Do?
So far, I have not been able to convince the OP about the path taken by the regex engine, so I thought I would paste a trace. Sadly, I do not have a trace tool for .NET regex, but I do have a trace tool for PCRE, an equally if not more potent regex engine.
The trace shows the path of the engine, as reported by pcretest, for the three test strings. Note one tiny difference with the regex in the .NET code supplied above: instead of good|bad
we have goo|bad
to accommodate for PCRE's lack of support for variable-length negative lookbehinds.
The trace clearly shows that PCRE evaluates the regex from left to right: first the lookbehind, then the \w+pher
I cannot say with 100% certainty that .NET proceeds in the same way, but this has certainly been the standard for regex engine. Of course there is a possibility that engines that support variable-length lookbehinds would proceed differently.
PCRE version 8.34 2013-12-15
~(?<=(?(?=go)goo|bad))\w+pher~C
badphilosopher
--->badphilosopher
+0 ^ (?<=(?(?=go)goo|bad))
+0 ^ (?<=(?(?=go)goo|bad))
+0 ^ (?<=(?(?=go)goo|bad))
+0 ^ (?<=(?(?=go)goo|bad))
+4 ^ ^(?(?=go)goo|bad)
+6 ^ ^(?=go)
+9 ^ ^g
+16 ^ ^b
+17 ^ ^a
+18 ^ ^d
+19 ^ )
+20 ^ )
+21 ^ \w+
+24 ^ ^ p
+24 ^ ^ p
+24 ^ ^ p
+24 ^ ^ p
+24 ^ ^ p
+25 ^ ^ h
+26 ^ ^ e
+27 ^ ^ r
+28 ^ ^
0: philosopher
googopher
--->googopher
+0 ^ (?<=(?(?=go)goo|bad))
+0 ^ (?<=(?(?=go)goo|bad))
+0 ^ (?<=(?(?=go)goo|bad))
+0 ^ (?<=(?(?=go)goo|bad))
+4 ^ ^(?(?=go)goo|bad)
+6 ^ ^(?=go)
+9 ^ ^g
+10 ^ ^o
+11 ^ ^)
+12 ^ ^g
+13 ^ ^o
+14 ^ ^o
+15 ^ |
+20 ^ )
+21 ^ \w+
+24 ^ ^ p
+24 ^ ^ p
+24 ^ ^ p
+24 ^ ^ p
+24 ^ ^ p
+25 ^ ^ h
+26 ^ ^ e
+27 ^ ^ r
+28 ^ ^
0: gopher
badgopher
--->badgopher
+0 ^ (?<=(?(?=go)goo|bad))
+0 ^ (?<=(?(?=go)goo|bad))
+0 ^ (?<=(?(?=go)goo|bad))
+0 ^ (?<=(?(?=go)goo|bad))
+4 ^ ^(?(?=go)goo|bad)
+6 ^ ^(?=go)
+9 ^ ^g
+16 ^ ^b
+17 ^ ^a
+18 ^ ^d
+19 ^ )
+20 ^ )
+21 ^ \w+
+24 ^ ^ p
+24 ^ ^ p
+24 ^ ^ p
+24 ^ ^ p
+24 ^ ^ p
+25 ^ ^ h
+26 ^ ^ e
+27 ^ ^ r
+28 ^ ^
0: gopher
Upvotes: 4
Reputation: 75222
I don't see any need for lookbehinds here. Just match the whole string normally and use a capturing group to extract the part that interests you:
Regex r = new Regex(
@"(?imn)^([A-Z0-9]{1,2}\s+|(?![A-Z]+\s)[0-9A-Z/""'-]{3,}\s+)(?<captured>matching\s+text)$");
resultString = r.Match(s).Groups["captured"].Value;
Upvotes: 0
Reputation: 12797
We can start with \s+matching\s+text
.
To eliminate match for and matching text
and similar we can add negative lookbehind: (?<!^\s*[A-Z]{3,})
.
To include match for others we can use positive lookbehind:
Match 2-letter patterns: [A-Z\d]{1,2}
Match longer patterns: [-A-Z\d/"']{3,}
<= important, we eliminated strings composed only of letters with negative lookbehind.
Combining together 2 patterns we get the following: (?<=^\s*([A-Z\d]{1,2}|[-A-Z\d/"']{3,}))
Full regex would be:
(?<!^\s*[A-Z]{3,})(?<=^\s*([A-Z\d]{1,2}|[-A-Z\d/"']{3,}))\s+matching\s+text
Upvotes: 0
Reputation: 89547
You can try something like this:
(?<=(?:(?=\S*\d)[\p{P}\d]+|[A-Z0-9]+)\s)matching text
Upvotes: 0