David Callanan
David Callanan

Reputation: 5968

regex how to match a capture group more than once

I have the following regex:

\{(\w+)(?:\{(\w+))+\}+\}

I need it to match any of the following

{a{b}}

{a{b{c}}}

{a{b{c{d...}}}}

But by using the regex for example on the last one it only matches two groups: a and c it doesn't match the b and 'c', or any other words that might be in between.

How do I get the group to match each single one like:

group #1: a
group #2: b
group #3: c
group #4: d
group #4: etc...

or like

group #1: a
group #2: [b, c, d, etc...]

Also how do I make it so that you have the same amount of { on the left is there are } on the right, otherwise don't match?

Thanks for the help,

David

Upvotes: 4

Views: 5279

Answers (2)

Dmitry Egorov
Dmitry Egorov

Reputation: 9650

For regex flavours supporting recursion (PCRE, Ruby) you may employ the following generic pattern:

^({\w+(?1)?})$

It allows to check if the input matches the defined pattern but does not capture desired groups. See Matching Balanced Constructs section in http://www.regular-expressions.info/recurse.html for details.

In order to capture the groups we may convert the pattern checking regex into a positive lookahead which would be checked only once at the start of string ((?:^(?=({\w+(?1)?})$)|\G(?!\A))) and then just capture all "words" using global search:

(?:^(?=({\w+(?1)?})$)|\G(?!\A)){(\w+)

The a, b, c, etc. are now in the second capture groups.

Regex demo: https://regex101.com/r/2wsR10/2. PHP demo: https://ideone.com/UKTfcm.

Explanation:

  • (?: - start of alternation group
    • [first alternative]:
      • ^ - start of string
      • (?= - start of positive lookahead
      • ({\w+(?1)?}) - the generic pattern from above
      • $ - enf of string
      • ) - end of positive lookahead
    • | - or
    • [second alternative]:
      • \G - end of previous match
      • (?!\A) - ensure the previous \G does not match the start of the input if the first alternative failed
  • ) - end of alternation group
  • { - opening brace literally
  • (\w+) - a "word" captured in the second group.

Ruby has different syntax for recursion and the regex would be:

(?:^(?=({\w+\g<1>?})$)|\G(?!\A)){(\w+)

Demo: http://rubular.com/r/jOJRhwJvR4

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626826

In .NET, a regex can 1) check balanced groups and 2) stores a capture collection per each capturing group in a group stack.

With the following regex, you may extract all the texts inside each {...} only if the whole string starting with { and ending with } contains a balanced amount of those open/close curly braces:

^{(?:(?<c>[^{}]+)|(?<o>){|(?<-o>)})*(?(o)(?!))}$

See the regex demo.

Details:

  • ^ - start of string
  • { - an open brace
  • (?: - start of a group of alternatives:
    • (?<c>[^{}]+) - 1+ chars other than { and } captured into "c" group
    • | - or
    • (?<o>{) - { is matched and a value is pushed to the Group "o" stack
    • | - or
    • (?<-o>}) - a } is matched and a value is popped from Group "o" stack
  • )* - end of the alternation group, repeated 0+ times
  • (?(o)(?!)) - a conditional construct checking if Group "o" stack is empty
  • } - a close }
  • $ - end of string.

C# demo:

var pattern = "^{(?:(?<c>[^{}]+)|(?<o>{)|(?<-o>}))*(?(o)(?!))}$";
var result = Regex.Matches("{a{bb{ccc{dd}}}}", pattern)
          .Cast<Match>().Select(p => p.Groups["c"].Captures)
          .ToList();

Output for {a{bb{ccc{dd}}}} is [a, bb, ccc, dd] while for {{a{bb{ccc{dd}}}} (a { is added at the beginning), results are empty.

Upvotes: 3

Related Questions