eghetto
eghetto

Reputation: 263

Regex: Lines between two Strings as separate Matches

I'm trying to extract lines between two string as separate matches:

START-OF-FIELDS
Line A
Line B
Line C
END-OF-FIELDS

This is my regex:

(?<=START-OF-FIELDS)(.*\n)*(?=END-OF-FIELDS)

Result is just ONE match containing all three lines. How do I get THREE matches:

Upvotes: 0

Views: 6665

Answers (3)

Woodham
Woodham

Reputation: 4263

I would use a negative lookahead

^(?!START\-OF\-FIELDS|END\-OF\-FIELDS)(.*)$

You will also need the m and g modifiers (multiline and global)

Demo here http://regex101.com/r/xC7qJ2/2

Edit:

Amendment: I have also text before START-OF-FIELDS as well as text after END-OF-FIELDS. In this case, I'll get too many matches. The matches must be between those two strings!

Ah fair enough. In that case, for completeness sake, I would personally just use a pattern like this (:?START\-OF\-FIELDS)\n(.*)\n(:?END\-OF\-FIELDS) with the modifiers mgs and then in code split the single capture on the newline character in code.

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

With .net you can use this pattern in a global research:

with the multiline option:

@"(?:\G(?!\A)|START-OF-FIELDS)\r?\n(.*)(?>\r?\nEND-OF-FIELD(?=S\r?$))?"

The result is in capture group 1.

The pattern works with 2 entry points. The first one is "START-OF-FIELDS" that is used for the first result. The second is \G(?!\A) that is used for other results.

\G is an anchor for the position in the string after the last match. At the begining \G is initialized to the start of the string position, to avoid this special case, I added (?!\A) to be sure that this branch fails at the first position.

With \G only contigous match are allowed after the first result.

To break the contiguity, I added an optional non capturing group that match "END-OF-FIELDS" but without the last character.

You can see a demo here.

An other way is possible with C#, since it is possible to extract all that have been matched by a repeated capturing group:

With this pattern:

string pattern = @"START-OF-FIELDS\r?\n(?>(.*)\r?\n)*?(?>END-OF-FIELD(?=S\r?$))";

Match match = Regex.Match(input, pattern, RegexOptions.Multiline);

if (match.Success) {
    foreach (Capture capture in match.Groups[1].Captures) {
        Console.WriteLine(capture.Value);
    }
}

The advantage of this way is that the search stops when the fields are found.

Upvotes: 1

ghoti
ghoti

Reputation: 46826

The answer to your question is "no".

Here's why.

The regex you offered was this:

(?<=START-OF-FIELDS)(.*\n)*(?=END-OF-FIELDS)

Note that there are THREE bracketed subexpressions here. Two of them are lookarounds, but between the lookarounds is one bracketed subexpression.

I believe that your (.*\n)* is matching the text up to the first newline, putting it into $1 (or \\1 or whatever). Then the trailing * allows for repeats of the bracketed subexpression that will be never put into a return value, since they aren't part of the initial bracketed subexpression. If you didn't already have that first bracketed subexpression, you wouldn't have something to repeat. The subsequent repeats aren't returned as results because they aren't inside their own set of brackets.

I see two ways to get around this.

First way would be to put the entire matching text into a separate string, like:

(?<=START-OF-FIELDS)((.*\n)*)(?=END-OF-FIELDS)

Now you've got the repeated text in $1, and you can split by newline.

Second way would only work if you know that you have only three lines. That would be:

(?<=START-OF-FIELDS)(.*\n)(.*\n)(.*\n)(?=END-OF-FIELDS)

Now you've got multiple subexpressions, one for each line.

Neither of these does exactly what you want, hence my initial answer of "no". :-)

Upvotes: 0

Related Questions