utopia
utopia

Reputation: 490

.net regex with condition lookbehind and capture group

Pattern: a(?(?<! ) )b (c)

Input: a b c

Desription: Condition should match space, if lookbehind is not a space.

It matches correct, but the capture group $1 is empty (instad of containing c).

Is this a problem with .net regex or am I missing something?

Example: http://regexstorm.net/tester?p=a(%3f(%3f%3C!+)+)b+(c)&i=a+b+c

Upvotes: 6

Views: 226

Answers (2)

Dmitry Egorov
Dmitry Egorov

Reputation: 9650

In addition to @revo's answer:

Not only conditional construct with an explicit zero-width assertion as its expression are affected. In fact almost all conditional constructs where condition expressions are parenthesized regexes (grouping, conditional, other special) used without extra parenthesis are affected.

There are four types of (mis)behaviour in such cases:

  1. Capture group array gets mangled (as pointed out by the OP), namely the capture group immediately following the conditional construct is lost the other groups are shifted left leaving the last capture group undefined.

    In the following examples the expected capture allocation is

    $1="a", $2="b", $3="c"
    

    while the actual result is

    $1="a", $2="c", $3="" (the latter is empty string)
    

    Applies to:

  2. Throws ArgumentException at run time when the regex is parsed. This actually makes sense since this explicitly warns us of some potential regex error rather than playing funny tricks with captures as in the previous case.

    Applies to:

    • (a)(?(?<n>.) )(b) (c), (a)(?(?'n'.) )(b) (c) - named groups - exception message: "Alternation conditions do not capture and cannot be named"
    • (a)(?(?'-n' .) )(b) (c), (?<a>a)(?(?<a-n>.) )(b) (c) - balancing groups - exception message: "Alternation conditions do not capture and cannot be named"
    • (a)(?(?# comment) )(b) (c) - inline comment - exception message: "Alternation conditions cannot be comments"
  3. Throws OutOfMemoryException during pattern match. This is clearly a bug, as of my belief.

    Applies to:

    • (a)(?(?i) )(b) (c) - inline options (not to be confused with group options)
  4. [Surprisingly] works as expected but this is rather too artificial example:

All these regexes may be fixed by enclosing the condition expression into explicit parenthesis (i.e. extra ones if the expression itself already contains parenthesis). Here are the fixed versions (in the order of appearance):

(a)(?((?=.)) )(b) (c)
(a)(?((?!z)) )(b) (c)
(a)(?((?<=.)) )(b) (c)
(a)(?((?<! )) )(b) (c)
(a)(?((?: )) )(b) (c)
(a)(?((?i:.)) )(b) (c)
(a)(?((?>.)) )(b) (c)
(a)(?((?(1).)) )(b) (c)
((?<n>a))(?((?(n).)) )(b)(c)
(a)(?((?(?:.).)) )(b) (c)
(a)(?((?<n>.)) )(b) (c)
(a)(?((?'n'.)) )(b) (c)
(a)(?((?'-n' .)) )(b) (c)
(?<a>a)(?((?<a-n>.)) )(b) (c)
(a)(?((?# comment)) )(b) (c)
(a)(?((?i)) )(b) (c)
(a)(?((?(.).)) )(b) (c)

Sample code to check all these expressions: https://ideone.com/KHbqMI

Upvotes: 2

revo
revo

Reputation: 48761

I'm not sure if this behavior is documented or not (if yes then I didn't find it) but using a conditional construct including an explicit zero-width assertion as its expression (?(?=expression)yes|no) overrides the very next numbered capturing group (empties it). You can confirm this by running below RegEx:

a(?(?<! ) )b (c)()

Four ways to overcome this issue:

  1. Enclosing expression in parentheses noted by @DmitryEgorov (that also keeps second capturing group intact) and is not included in result - the right way:

    a(?((?<! )) )b (c)
    
  2. As this behavior is only applied to unnamed capturing groups (default) you can get expected result using a named capturing group:

    a(?(?<! ) )b (?<first>c)
    
  3. Adding an extra capturing group where ever you like between (c) and conditional:

    a(?(?<! ) )(b) (c)
    
  4. Avoiding such an expression if possible. E.g:

    a(?( ) )b (c)
    

Upvotes: 4

Related Questions