Regex with fixed start and end and repetitive groups inside

Question

VB2012: I have a string I want to parse out. It has a fixed start and end string but inside there are repetitive strings.

Input string looks like this with much more of the same type of data between START and END.

START;data[0][1]="2000";data[0][2]="2015-09-25";data[0][3]="XYZ";END;

My current regex looks like this

(data\[(?\d{1,2})]\[(?\d{1,2})]="(?.*?)";)

That works great and matches the repetitive strings inside:

Match Number    Match Text                      Group 1                             row col val
0               "data[0][1]=""2000"";"          "data[0][1]=""2000"";"              "0" "1" "2000"
1               "data[0][2]=""2015-09-25"";"    "data[0][2]=""2015-09-25"";"        "0" "2" "2015-09-25"
2               "data[0][3]=""XYZ"";"           "data[0][3]=""XYZ"";"               "0" "3" "XYZ"

I want to make the match a bit more accurate by matching the START string, then repetitive strings, then and END string. My attempt has been of the form:

START;(data\[(?\d{1,2})]\[(?\d{1,2})]="(?.*?)";)*END;

But that gives me an output where the different groups are on their own and not part of a bigger match. I'm stuck on what I should try.

Lucas Trzesniewski · Accepted Answer

Let's take your example:

START;data[0][1]="2000";data[0][2]="2015-09-25";data[0][3]="XYZ";END;

along with your second regex:

START;(data\[(?\d{1,2})]\[(?\d{1,2})]="(?.*?)";)*END;

So, what do we get here?

The pattern is wrapped in START;(...[values]...)*END;, and you're using a * quantifier. There are further capture groups in the [values] part.

So, a match looks like this:

START;data[0][1]="2000";data[0][2]="2015-09-25";data[0][3]="XYZ";END;
           R  C   VVVV       R  C   VVVVVVVVVV       R  C   VVV        <-- groups
      \________________/\______________________/\_______________/      <-- [values]
\___________________________________________________________________/  <-- full match

The [values] part of the regex matches 3 times. R is the value captured by the row group, C is what's captured by col, and VVV is what's captured by val.

In such a case, most other regex engines would throw away all but the last capture, and you'd get only the values 0, 3 and XYZ from your match.

But .NET supports multiple captures per group. So you can extract all the captured substrings, for each iteration of the enclosing * quantifier.

Each item in Match.Groups corresponds to a capture group in the pattern (e.g. the (?...) group).
Each item in Match.Groups("row").Captures corresponds to a given capture in an iteration of a quantifier during the match.

Which means, when a given capture group is used several times during a match, you'll get several captures for it.

Contrast it with the first regex:

(data\[(?\d{1,2})]\[(?\d{1,2})]="(?.*?)";)

Let's look at the matches:

START;data[0][1]="2000";data[0][2]="2015-09-25";data[0][3]="XYZ";END;
           R  C   VVVV       R  C   VVVVVVVVVV       R  C   VVV        <-- groups
      \________________/\______________________/\_______________/      <-- whole matches

Each match has only one capture instance for each capturing group.

Regex with fixed start and end and repetitive groups inside

Answers (1)

Related Questions