Reputation: 2282
My data is:
Hello
Test1
Begin
* nm: 866 444 988
* nm: 08 66
# allowed * nm: 77 2
End
* nm: 0
i want capture each digit between markers Begin
and End
and must be preceded by
* num: or # allowed * nm:
My pattern work well in .Net (i use capture collection) but it not work in other engine ...my question is how can add another ancher \G to capture nested contigious digits: ( the question is about mastering the \G anchor)
(?mxi:
\G(?!\A)(?:^# allowed[ ])?
|
^Begin\r\n
)
\*[ ]nm:[ ]
(?>(?'digit'\d+)|[ ])+ # the problem is here it return all digits in one group
\r\n
It return each digit in capture value
Thanks
Edit: I found a solution but its not an élégant pattern:
(?mx:
\G(?!\A)
|
^Begin\r?\n
)
(?:#[ ]allowed[ ])?
\*[ ]nm:
|
(?!^)\G[ ]*(\d+)\s*
Edit: 2)
Another problèm with my second pattern: if i add [ ]*\r?\n
instead of \s* in the end of the pattern it fail. Why?
(?xm:
\G(?!\A)
|
^Begin\r?\n
)
(?:#[ ]allowed[ ])?
\*[ ]nm:
|
(?!^)\G[ ]*(\d+)
[ ]*\r?\n # <-- the problem here
Upvotes: 2
Views: 151
Reputation: 89629
You can use this pattern: (Java/PCRE/Perl/.NET version)
(?xm) # switch on freespacing mode and multiline mode*
(?: \G(?!\A) | ^Begin\r?$ ) # two entry points: the end of the last match OR
# "Begin" that starts and ends a line
(?> \n # a newline can start with:
(?:
(?:\Q# allowed \E)? \Q* nm:\E # 1) the start of a line with numbers,
|
(?=End\r?$) # 2) the last line end of a block,
|
.* # 3) or an other full line
)
)* # this group is optional to allow several consecutive numbers,
# but the branch 3) can be repeated several times until the branch 1)
# matches and the first number is found, or until the branch 2) matches
# and closes the block.
\Q \E # a space
(\d+) \r? # the number
(*) be careful with the multiline mode and Ruby: In other languages the multiline mode changes the meaning of ^
and $
anchors from "start of the string" and "end of the string" to "start of the line" and "end of the line". In Ruby the multiline mode allows the dot to match newlines (an equivalent of "singleline" or "dotall" mode for other languages). In Ruby ^
and $
matches the start and end of line by default whatever the mode.
This only uses the fact that numbers are not a the start of a line.
When the regex engine takes the branch 2) of the alternation the pattern will automatically fail since (?=End$)
can not be followed by \Q \E (\d+)
. Since the newline and three branches are enclosed in an atomic group, the regex engine has no possibilities to backtrack and to try the branch 3). In this way, the contiguity is broken, each time the branch 2) matches.
Notices:
The \Q...\E
feature allows to write a literal string without escaping special characters. In freespacing mode, all spaces inside \Q...\E
are taken in account.
To make this pattern work with ruby, you need to remove the m modifier, to remove all \Q
and \E
and to escape or enclose in a character class all spaces, special characters and the sharp used in freespacing to write a comment.
Example: (?:\Q# allowed \E)? \Q* nm:\E
=> (?:\#[ ]allowed[ ])? \*[ ]nm:
Upvotes: 1
Reputation:
The number is in group 1 on each match. It won't be a capture collection, but that's why \G
is there anyway. Also, due to the nature of this, it just invalidates the match position when
end
is found.
edit - Note that you could put a capture group around (Begin)
as a flag for the start of a new block.
# (?mi:(?!\A)\G|(?:(?:^Begin|(?!\A)\G)(?s:(?!^End).)*?(?:^(?:\#[ ]+allowed[ ]+)?\*[ ]+nm:)))[ ]+(\d+)
(?xmi:
(?! \A )
\G
|
(?:
(?:
^ Begin
|
(?! \A )
\G
)
(?s:
(?! ^ End )
.
)*?
(?:
^
(?: \# [ ]+ allowed [ ]+ )?
\* [ ]+ nm:
)
)
)
[ ]+
( \d+ ) # (1)
With extra comments:
(?xmi:
(?! \A ) # Here, matched before, give '[ ]+\d+` a first chance
\G # to match again.
|
(?: # Here, could have matched before
(?:
^ Begin # Give a new begin position first chance
| # or,
(?! \A ) # See if this matched before
\G
)
# If this is new begin or matched before, move the position up to
# the first/next delimiter 'nm:'
(?s: # Lazy, move the position along (dot-all in this cluster)
(?! ^ End )
.
)*?
(?: # Here we found the first/next delimiter
^
(?: \# [ ]+ allowed [ ]+ )?
\* [ ]+ nm:
)
)
)
[ ]+
( \d+ ) # (1)
Upvotes: 1