Reputation: 490
Pattern: a(?(?<! ) )b (c)
Input: a b c
Desription: Condition should match space, if lookbehind is not a space.
It matches correct, but the capture group $1 is empty (instad of containing c).
Is this a problem with .net regex or am I missing something?
Example: http://regexstorm.net/tester?p=a(%3f(%3f%3C!+)+)b+(c)&i=a+b+c
Upvotes: 6
Views: 226
Reputation: 9650
In addition to @revo's answer:
Not only conditional construct with an explicit zero-width assertion as its expression are affected. In fact almost all conditional constructs where condition expressions are parenthesized regexes (grouping, conditional, other special) used without extra parenthesis are affected.
There are four types of (mis)behaviour in such cases:
Capture group array gets mangled (as pointed out by the OP), namely the capture group immediately following the conditional construct is lost the other groups are shifted left leaving the last capture group undefined.
In the following examples the expected capture allocation is
$1="a", $2="b", $3="c"
while the actual result is
$1="a", $2="c", $3="" (the latter is empty string)
Applies to:
(a)(?(?=.) )(b) (c)
- positive lookahead(a)(?(?!z) )(b) (c)
- negative lookahead(a)(?(?<=.) )(b) (c)
- positive lookbehind(a)(?(?<! ) )(b) (c)
- negative lookbehind(a)(?(?: ) )(b) (c)
- noncapturing group(a)(?(?i:.) )(b) (c)
- group options(a)(?(?>.) )(b) (c)
- nonbacktracking subexpression(a)(?(?(1).) )(b) (c)
- nested condition on a capture group by number((?<n>a))(?(?(n).) )(b)(c)
- nested condition on a capture group by name(a)(?(?(?:.).) )(b) (c)
- nested condition with implicitly parenthesized regexThrows ArgumentException
at run time when the regex is parsed. This actually makes sense since this explicitly warns us of some potential regex error rather than playing funny tricks with captures as in the previous case.
Applies to:
(a)(?(?<n>.) )(b) (c)
, (a)(?(?'n'.) )(b) (c)
- named groups - exception message: "Alternation conditions do not capture and cannot be named"
(a)(?(?'-n' .) )(b) (c)
, (?<a>a)(?(?<a-n>.) )(b) (c)
- balancing groups - exception message: "Alternation conditions do not capture and cannot be named"
(a)(?(?# comment) )(b) (c)
- inline comment - exception message: "Alternation conditions cannot be comments"
Throws OutOfMemoryException
during pattern match.
This is clearly a bug, as of my belief.
Applies to:
(a)(?(?i) )(b) (c)
- inline options (not to be confused with group options)[Surprisingly] works as expected but this is rather too artificial example:
(a)(?(?(.).) )(b) (c)
- nested condition with explicitly parenthesized regexAll these regexes may be fixed by enclosing the condition expression into explicit parenthesis (i.e. extra ones if the expression itself already contains parenthesis). Here are the fixed versions (in the order of appearance):
(a)(?((?=.)) )(b) (c)
(a)(?((?!z)) )(b) (c)
(a)(?((?<=.)) )(b) (c)
(a)(?((?<! )) )(b) (c)
(a)(?((?: )) )(b) (c)
(a)(?((?i:.)) )(b) (c)
(a)(?((?>.)) )(b) (c)
(a)(?((?(1).)) )(b) (c)
((?<n>a))(?((?(n).)) )(b)(c)
(a)(?((?(?:.).)) )(b) (c)
(a)(?((?<n>.)) )(b) (c)
(a)(?((?'n'.)) )(b) (c)
(a)(?((?'-n' .)) )(b) (c)
(?<a>a)(?((?<a-n>.)) )(b) (c)
(a)(?((?# comment)) )(b) (c)
(a)(?((?i)) )(b) (c)
(a)(?((?(.).)) )(b) (c)
Sample code to check all these expressions: https://ideone.com/KHbqMI
Upvotes: 2
Reputation: 48761
I'm not sure if this behavior is documented or not (if yes then I didn't find it) but using a conditional construct including an explicit zero-width assertion as its expression (?(?=expression)yes|no)
overrides the very next numbered capturing group (empties it). You can confirm this by running below RegEx:
a(?(?<! ) )b (c)()
Four ways to overcome this issue:
Enclosing expression in parentheses noted by @DmitryEgorov (that also keeps second capturing group intact) and is not included in result - the right way:
a(?((?<! )) )b (c)
As this behavior is only applied to unnamed capturing groups (default) you can get expected result using a named capturing group:
a(?(?<! ) )b (?<first>c)
Adding an extra capturing group where ever you like between (c)
and conditional:
a(?(?<! ) )(b) (c)
Avoiding such an expression if possible. E.g:
a(?( ) )b (c)
Upvotes: 4