Regular expression - text between multiple occurences of the same pattern

Question

I need to parse a large number of files and process some content based on certain tokens. In order to do this I have to take each token and the text after it, until the next token (with extra new lines).

A ---
some text of many lines
B --- 

other text with some lines

C --- 
more text and tokens and text

I've been using regex101 and made it up to splitting them

(?^([a-zA-Z].--.*))|(?.*)

However, I can't get the second match in a single group. The desired result is to get the token and text following pairs.

Is this possible to accomplish this using a single regex expression ? and how ?

Thanks

Wiktor Stribiżew · Accepted Answer

Let's assume your token pattern is correct and matches all you need. Then, the content is everything after the token pattern up to the first occurrence of the token pattern, that is ^[a-zA-Z].--.*: start of the line (^), an ASCII letter ([a-zA-Z]), any char but a newline (.), two hyphens (--) and then any 0+ chars, as many as possible, up to the end of the line (note, in .NET regex, . also matches CR " " symbol).

If your files are not that big, you could use

@"(?m)^(?[a-zA-Z].--.*)(?(?:
?
(?![a-zA-Z].---).*)*)"

See the regex demo. This regex accounts for the cases when the token has no content, and also excludes matching the token in the middle of some content.

From the structural point of view, the pattern is equal to (?m)^(?[a-zA-Z].--.*)(?(?s:.*?))(?=^[a-zA-Z].---|\z), but is a more efficient version, since the lazy dot matching pattern constrained with a lookahead having two alternatives makes the regex engine work hard when matching each char in the input string. An unrolled pattern like the one I suggest will grab whole lines that do not start with the token at once, and thus it will work much quicker.

Details:

(?m) - same as RegexOptions.Multiline, the ^ matches line start now (and $ matches the end of the line, not whole string)
^ - start of the line
(?[a-zA-Z].--.*) - "token" group:
- [a-zA-Z] - an ASCII letter
- . - any char but a newline (also, matches CR, use [^ ] to only match a char that is not a part of the CRLF ending)
- -- - two hyphens
- .* - any 0+ chars other than a newline, as many as possible, up to the end of the line (note the . matches CR in .NET regex)
(?(?: ? (?![a-zA-Z].---).*)*) - "content" group:
- (?: ? (?![a-zA-Z].---).*)* - zero or more sequences of:
  - ? (?![a-zA-Z].---) - a CRLF or an LF line end not followed with the token pattern
  - .* - any 0+ chars other than a newline, as many as possible, up to the end of the line

C# demo (note I am trimming both the group values to get rid of leading/trailing whitespace):

var s = "A ---
some text of many lines
B ---

other text with some lines
and text and
text 

C --- 
more text and tokens and text

QQ--- 

more text more text

HH---
JJ---
";
var pat = @"^(?[a-zA-Z].--.*)(?(?:
?
(?![a-zA-Z].---).*)*)";
var result = Regex.Matches(s, pat, RegexOptions.Multiline)
        .Cast()
        .Select(m => new[] {m.Groups["token"].Value.Trim(), m.Groups["content"].Value.Trim()});
foreach (var pair in result)
    Console.WriteLine($"--- New match ---
Token: {pair[0]}
Content: {pair[1]}");

Output:

--- New match ---
Token: A ---
Content: some text of many lines
--- New match ---
Token: B ---
Content: other text with some lines
and text and
text
--- New match ---
Token: C ---
Content: more text and tokens and text
--- New match ---
Token: QQ---
Content: more text more text
--- New match ---
Token: HH---
Content: 
--- New match ---
Token: JJ---
Content:

Regular expression - text between multiple occurences of the same pattern

Answers (2)

Related Questions