Mike Spike
Mike Spike

Reputation: 499

Regex capture repeating groups

I have an input that looks like this:

<ID>0<VAL>a1b<ID>1<VAL>a2b<ID>2<VAL>a3b<ID>3<VAL>a4b

I'd need to capture key-value pairs (e.g. id - val) or at least an array of groups as the following: [0, a1b, 1, a2b, 2, a3b, 3, a4b]

Capturing just one pair (i.e. when the input contains only a single pair) works with this:

(?>(?:<ID>(\d+))(?:<VAL>(.+)))?

the result being: [0, a1b].

But it doesn't work for multiple pairs - it captures 0 as a group then as a 2nd group it takes the rest of the input, excluding the first <VAL> tag, as in: [0, a1b<ID>1<VAL>a2b<ID>2<VAL>a3b<ID>3<VAL>a4b]

Can someone point me to a direction I should look into?

UPDATE: what if <ID> and <VAL> are some special chars, for example 0x8F and 0x9F?

Upvotes: 1

Views: 86

Answers (2)

Mike Spike
Mike Spike

Reputation: 499

@bobble-bubble's solutions is the most efficient (according to regex101): 4 matches in 72 steps and 1ms, but it's very restrictive. To fix this, the \w can be replaced with [a-z\d], then it becomes even faster: 4 matches in 72 steps and 0ms.

@WiktorStribiżew's solution is the next most efficient: 4 matches in 64 steps and 4ms.

@albina's solution is the least efficient: 7 matches in 153 steps and 10ms

Upvotes: 1

Albina
Albina

Reputation: 1985

This regex matches keys and then values.

(?<=<ID>)(\d+)(?=<VAL>)|(?<=<VAL>)[a-z\d]*(?=<ID>)

There are 2 groups:

  • (?<=<ID>)(\d+)(?=<VAL>) matches a key \d+ between <ID> and <VAL> using positive lookbehind and lookahead
    • (?<=<ID>) is a positive lookbehind
    • (?=<VAL>) is a positive lookahead
  • (?<=<VAL>)[a-z\d]*(?=<ID>) matches a value between <VAL> and <ID> using positive lookbehind and lookahead
    • [a-z\d]* matches a value
    • (?<=<VAL>) is a positive lookbehind
    • (?=<ID>) is a positive lookahead

regex101.com

Upvotes: 2

Related Questions