allowing nested contigious match in regex

Question

My data is:

Hello
Test1
Begin
* nm: 866 444 988
* nm: 08 66
# allowed * nm: 77 2
End
* nm: 0

i want capture each digit between markers Begin and End and must be preceded by

* num: or # allowed * nm:

My pattern work well in .Net (i use capture collection) but it not work in other engine ...my question is how can add another ancher \G to capture nested contigious digits: ( the question is about mastering the \G anchor)

(?mxi:
  \G(?!\A)(?:^# allowed[ ])?
   |
  ^Begin

 )
 \*[ ]nm:[ ]
 (?>(?'digit'\d+)|[ ])+ # the problem is here it return all digits in one group

It return each digit in capture value

Thanks

Edit: I found a solution but its not an élégant pattern:

(?mx:
   \G(?!\A)
      |
   ^Begin
?

)
(?:#[ ]allowed[ ])?
\*[ ]nm:
  |
(?!^)\G[ ]*(\d+)\s*

DEMO

Edit: 2)

Another problèm with my second pattern: if i add [ ]* ? instead of \s* in the end of the pattern it fail. Why?

 (?xm:
     \G(?!\A)
         |
     ^Begin
?

 )
 (?:#[ ]allowed[ ])?
 \*[ ]nm:
     |
 (?!^)\G[ ]*(\d+)
 [ ]*
?
 # <-- the problem here

Casimir et Hippolyte · Accepted Answer

You can use this pattern: (Java/PCRE/Perl/.NET version)

(?xm)  # switch on freespacing mode and multiline mode*
(?: \G(?!\A) | ^Begin
?$ )  # two entry points: the end of the last match OR
                             # "Begin" that starts and ends a line

(?> 
  # a newline can start with:
    (?:
        (?:\Q# allowed \E)? \Q* nm:\E  # 1) the start of a line with numbers,
      |
        (?=End
?$)                    # 2) the last line end of a block,
      |
        .*                             # 3) or an other full line
    )  
)*  # this group is optional to allow several consecutive numbers,
    # but the branch 3) can be repeated several times until the branch 1)
    # matches and the first number is found, or until the branch 2) matches
    # and closes the block.
\Q \E      # a space
(\d+)  
? # the number

_{(*) be careful with the multiline mode and Ruby: In other languages the multiline mode changes the meaning of ^ and $ anchors from "start of the string" and "end of the string" to "start of the line" and "end of the line". In Ruby the multiline mode allows the dot to match newlines (an equivalent of "singleline" or "dotall" mode for other languages). In Ruby ^ and $ matches the start and end of line by default whatever the mode.}

This only uses the fact that numbers are not a the start of a line.

When the regex engine takes the branch 2) of the alternation the pattern will automatically fail since (?=End$) can not be followed by \Q \E (\d+). Since the newline and three branches are enclosed in an atomic group, the regex engine has no possibilities to backtrack and to try the branch 3). In this way, the contiguity is broken, each time the branch 2) matches.

Notices:
The \Q...\E feature allows to write a literal string without escaping special characters. In freespacing mode, all spaces inside \Q...\E are taken in account.

To make this pattern work with ruby, you need to remove the m modifier, to remove all \Q and \E and to escape or enclose in a character class all spaces, special characters and the sharp used in freespacing to write a comment.
Example: (?:\Q# allowed \E)? \Q* nm:\E => (?:\#[ ]allowed[ ])? \*[ ]nm:

allowing nested contigious match in regex

Answers (2)

Related Questions