Reputation: 587
I need to parse a large number of files and process some content based on certain tokens. In order to do this I have to take each token and the text after it, until the next token (with extra new lines).
A ---
some text of many lines
B ---
other text with some lines
C ---
more text and tokens and text
I've been using regex101 and made it up to splitting them
(?<token>^([a-zA-Z].--.*))|(?<content>.*)
However, I can't get the second match in a single group. The desired result is to get the token and text following pairs.
Is this possible to accomplish this using a single regex expression ? and how ?
Thanks
Upvotes: 3
Views: 125
Reputation: 626927
Let's assume your token
pattern is correct and matches all you need. Then, the content is everything after the token
pattern up to the first occurrence of the token pattern, that is ^[a-zA-Z].--.*
: start of the line (^
), an ASCII letter ([a-zA-Z]
), any char but a newline (.
), two hyphens (--
) and then any 0+ chars, as many as possible, up to the end of the line (note, in .NET regex, .
also matches CR "\r"
symbol).
If your files are not that big, you could use
@"(?m)^(?<token>[a-zA-Z].--.*)(?<content>(?:\r?\n(?![a-zA-Z].---).*)*)"
See the regex demo. This regex accounts for the cases when the token has no content, and also excludes matching the token in the middle of some content.
From the structural point of view, the pattern is equal to (?m)^(?<token>[a-zA-Z].--.*)(?<content>(?s:.*?))(?=^[a-zA-Z].---|\z)
, but is a more efficient version, since the lazy dot matching pattern constrained with a lookahead having two alternatives makes the regex engine work hard when matching each char in the input string. An unrolled pattern like the one I suggest will grab whole lines that do not start with the token at once, and thus it will work much quicker.
Details:
(?m)
- same as RegexOptions.Multiline
, the ^
matches line start now (and $
matches the end of the line, not whole string)^
- start of the line(?<token>[a-zA-Z].--.*)
- "token" group:
[a-zA-Z]
- an ASCII letter.
- any char but a newline (also, matches CR, use [^\n\r]
to only match a char that is not a part of the CRLF ending)--
- two hyphens.*
- any 0+ chars other than a newline, as many as possible, up to the end of the line (note the .
matches CR in .NET regex)(?<content>(?:\r?\n(?![a-zA-Z].---).*)*)
- "content" group:
(?:\r?\n(?![a-zA-Z].---).*)*
- zero or more sequences of:
\r?\n(?![a-zA-Z].---)
- a CRLF or an LF line end not followed with the token pattern.*
- any 0+ chars other than a newline, as many as possible, up to the end of the lineC# demo (note I am trimming both the group values to get rid of leading/trailing whitespace):
var s = "A ---\r\nsome text of many lines\r\nB ---\r\n\r\nother text with some lines\r\nand text and\r\ntext \r\n\r\nC --- \r\nmore text and tokens and text\r\n\r\nQQ--- \r\n\r\nmore text more text\r\n\r\nHH---\r\nJJ---\r\n";
var pat = @"^(?<token>[a-zA-Z].--.*)(?<content>(?:\r?\n(?![a-zA-Z].---).*)*)";
var result = Regex.Matches(s, pat, RegexOptions.Multiline)
.Cast<Match>()
.Select(m => new[] {m.Groups["token"].Value.Trim(), m.Groups["content"].Value.Trim()});
foreach (var pair in result)
Console.WriteLine($"--- New match ---\nToken: {pair[0]}\nContent: {pair[1]}");
Output:
--- New match ---
Token: A ---
Content: some text of many lines
--- New match ---
Token: B ---
Content: other text with some lines
and text and
text
--- New match ---
Token: C ---
Content: more text and tokens and text
--- New match ---
Token: QQ---
Content: more text more text
--- New match ---
Token: HH---
Content:
--- New match ---
Token: JJ---
Content:
Upvotes: 2
Reputation: 4930
Here's what I was able to do to make your regex work.
/(?<token>[A-Za-z]+)\s*---\s*(?<content>.+?)(?=[A-Za-z]+\s*---\s*|$)/gs
https://regex101.com/r/x8tPHN/4
The difference between what I have and what you have is that there is a lookahead that checks for either a new token OR the end of the data.
I have the g(global) and s(dot equals new line) flags enabled.
Upvotes: 2