anatoly u
anatoly u

Reputation: 89

How to capture nested named groups when referencing outer group by name?

In the list of integer numbers separated by comma, I need to capture (via a PCRE regex) the first occurrence of 12* (if any) and the first occurrence of 45* (if any). How do I do that? I tried the following but it can only capture inside the first number in the sequence :(

(?P<number>(?P<n12>12\d)|(?P<n45>45\d)|\d+)(?:,(?P>number))*

Here's a sample string to test: 11,222,123,444,456,7. I expect to capture n12=123 and n45=456 here.

UPD

As a workaround, my own solution is to declare the delimiter optional (which it isn't), like this:

(?:,?(?P<number>(?P<n12>12\d)|(?P<n45>45\d)|\d+))*

- this works for me, but not in all cases (e.g. ,1234, 123,4, 1234 and ,123,4 are parsed identically) which i'd like to avoid if possible.

UPD2

N.B. C'mon, this is not the real task I'm faced with - it is just a simplified example. Here's another one so that you can get my idea better:

(?P<animal>(?P<cat>pussy|cat)|(?P<dog>doge|dog)|\w+)(?:,(?P>animal))*

pussy,mouse,dog,bird - has to capture: cat=pussy, dog=dog

Upvotes: 1

Views: 353

Answers (2)

anatoly u
anatoly u

Reputation: 89

Looks like PCRE doesn't allow to capture named subpatterns nested inside a named pattern called by reference. So the exact answer to the asked question is "There's no way. Sorry".

But there's a workaround for this specific case: instead of referencing the subpattern:

(?P<animal>...)(?:,(?P>animal))*

- you may avoid referencing it:

(?:,(?P<animal>...))*

- but this would require the subject to have a leading delimiter in the beginning, which it doesn't have.

A bad workaround for this is to mark the delimiter as optional:

(?:,?(?P<animal>...))*

- but it allows strange sequences to match.

A better solution is to mark the delimiter conditionally required: if the subpattern has already matched at least once, then the delimiter is required, otherwise it must be omitted:

(?:(?(animal),)(?P<animal>...))*

i.e

(?:(?(animal),)(?P<animal>(?P<cat>pussy|cat)|(?P<dog>doge|dog)|\w+))*

N.B. This will only capture the last match for each subpattern (if any).

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163362

Without named groups, you could capture either 12 or 45 in group 1, and for the second capture group recurse the first subpattern using (?1) and before it assert that it is not the same as what is already captured in group 1 using a negative lookahead with a backreference (?!\1)

^(?:\d+,)*?(12|45)(?:\d*(?:,\d+)*?,(?!\1)((?1)))?

Explanation

  • ^ Start of string
  • (?:\d+,)*? Match as least as possible optional repetitions of 1+ digits and ,
  • (12|45)\d* Capture either 12 or 45 in group 1
  • (?: Non capture group
    • (?:,\d+)*?, Match as least as possible optional repetitions of , and 1+ digits and match ,
    • (?!\1) Negative lookahead, assert not what was captured in group 1
    • ((?1)) Capture group 2, repeat the first subpattern
  • )? Close non capture group and make it optional to also allow matching 1 capture group

Regex demo


If you want named capture groups for a single or 2 group values, you can use an alternation with the J flag to allow duplicate subpattern names.

The pattern matches either first occurrence of 12 and then 45, or only 12 or only 45.

^(?:(?:\d+,)*?(?P<n12>12)\d*(?:,\d+)*?,(?P<n45>45)|(?:\d+,)*?(?P<n45>45)\d*(?:,\d+)*?,(?P<n12>12)|(?:\d+,)*?(?P<n12>12)|(?:\d+,)*?(?P<n45>45))

Regex demo

Upvotes: 2

Related Questions