Reputation: 37
In a long text file there are several headers.
I need to split this file so that I have each header and content separately.
Repeated headers are to be considered as one.
Minimum example:
HeaderA
example text
HeaderB
example text
HeaderC
example text
HeaderC
example text
HeaderD
example text
Using this regular expression in python I have managed that:
Header(\w)[\s\S]*?(?=Header(?!\1)|$)
note that both HeaderC are captured as one group.
Now I received a slightly altered text file that looks like this:
Header
foo
bar ID 1
foo
Header
foo
bar ID 2
foo
Header
foo
bar ID 3
foo
Header
foo
bar ID 3
foo
Header
foo
bar ID 4
foo
I tried altering my original expression for this example into this:
Header[\s\S]*?ID\s(\d)[\s\S]*?(?=Header[\s\S]*?ID\s(?!\1)|$)
However I can't get this to work using the same technique as in my first example.
my desired result is that
Header
foo
bar ID 3
foo
Header
foo
bar ID 3
foo
is one match, similar to the first demo. Any explanation as to what I'm missing would be greatly appreciated. A working counterexample would also be accepted.
Thank you in advance
Upvotes: 1
Views: 52
Reputation: 627219
You can use
(?s)Header.*?ID\s(\d).*?(?=Header(?!.*?ID\s(?=\1))|$)
Header[\w\W]*?ID\s(\d)[\w\W]*?(?=Header(?![\w\W]*?ID\s(?=\1))|$)
See the regex demo. The point here is to match up to the Header
string that is not followed with any text, ID
, whitespace not followed with the same number as in Group 1. In this case, the .*?
/[\w\W]*?
/[\s\S]*?
gets expanded until the Header
that contains a different number.
Note: (?s)
is a DOTALL inline modifier that lets .
match line break chars.
Here is how your regex matches first Header 3:
Here is my regex Header 3 match:
Upvotes: 1