matthias c
matthias c

Reputation: 37

capture group negative lookahead ignored by regex

In a long text file there are several headers.
I need to split this file so that I have each header and content separately.
Repeated headers are to be considered as one. Minimum example:

HeaderA
example text

HeaderB
example text

HeaderC
example text

HeaderC
example text

HeaderD
example text

Using this regular expression in python I have managed that:

Header(\w)[\s\S]*?(?=Header(?!\1)|$)

note that both HeaderC are captured as one group.

DEMO

Now I received a slightly altered text file that looks like this:

Header
foo
bar ID 1
foo

Header
foo 
bar ID 2
foo

Header
foo 
bar ID 3
foo

Header
foo 
bar ID 3
foo

Header
foo 
bar ID 4
foo

I tried altering my original expression for this example into this:

Header[\s\S]*?ID\s(\d)[\s\S]*?(?=Header[\s\S]*?ID\s(?!\1)|$)

DEMO2

However I can't get this to work using the same technique as in my first example.

my desired result is that

Header
foo 
bar ID 3
foo

Header
foo 
bar ID 3
foo

is one match, similar to the first demo. Any explanation as to what I'm missing would be greatly appreciated. A working counterexample would also be accepted.

Thank you in advance

Upvotes: 1

Views: 52

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627219

You can use

(?s)Header.*?ID\s(\d).*?(?=Header(?!.*?ID\s(?=\1))|$)
Header[\w\W]*?ID\s(\d)[\w\W]*?(?=Header(?![\w\W]*?ID\s(?=\1))|$)

See the regex demo. The point here is to match up to the Header string that is not followed with any text, ID, whitespace not followed with the same number as in Group 1. In this case, the .*?/[\w\W]*?/[\s\S]*? gets expanded until the Header that contains a different number.

Note: (?s) is a DOTALL inline modifier that lets . match line break chars.

Here is how your regex matches first Header 3:

enter image description here

Here is my regex Header 3 match:

enter image description here

Upvotes: 1

Related Questions