Regular expression to get text between Indicator and Tag

Question

I have a report of occurred errors generated from my program and now I want to create some kind of whitelist. Therefore I want to parse all errors which have the tag "@Whitelist"

So the report looks like this:

...
45)
Error: some description
Signal: lorem ipsum
@Whitelist

46)
Error: other description
Signal: lorem ipsum
File:   some file

47)
Error: lorem ipsum
Project: X
@Whitelist
@Comment description why this is sent to the whitelist
...

Here I want to have the Error Nr 45 and 47, but not 46

Ok to sum this up: I am trying to have a regular expression to get everything inbetween (including) the tag "Error" (which can be "Warning" or "Message" too) up to "@Comment" (including the comment tag with the message) and only if @Whitelist is present.

There can be N lines inbetween @Whitelist and the Error indicator

Actually I can't come up with a good solution for this problem, some professional out there? Many thanks in advance

Edit: I just realized that it could be possible that the report can change over time, for example there could be a headline added above a group of errors. Meaning: Error 46 and 47 have the same type, so there would be the line "File Read Errors: " above the Error 46. Thats why I wanted to have some kind of solution where I get the Error based on the Tag "Error|Warning|Message" and "@Whitelist" I hope it is kind of clear what I mean with this

nodakai · Accepted Answer

Update

One of @op's requirement can be formulated as:

If an error item contains a line beginning with @Comment, anything following that line should be discarded
Otherwise, the output extends to the end of the input

I found it very difficult to fulfill this requirement and ended up with three regexes for:

Distinguishing each error item in the input
Extract items containing Error: etc and @Whitelist (and discard anything before Error: etc)
Process @Comment as stated above

http://ideone.com/E58Ihe

import re

splitter = re.compile(r'
*(?=\d+\)
)')
filter = re.compile(
    r'^(Error|Warning|Message):.*@Whitelist.*', re.DOTALL | re.MULTILINE)
cleanup = re.compile(r'(^@Comment[^
]*).*', re.DOTALL | re.MULTILINE)
for chunk in splitter.split(input_str):
    m = filter.search(chunk)
    if m:
        output = cleanup.sub(r'\1', m.group(0))
        print("Output begin")
        print(output)
        print("Output end
")

Original answer

https://docs.python.org/2/library/re.html#regular-expression-syntax

Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

import re

regex = re.compile(r'(Error|Warning|Message)[^)]*@Whitelist[^)]*(?=(@Comment|

))')
for m in regex.finditer(input_str):
    print(m.group(0))

Error: some description
Signal: lorem ipsum
@Whitelist
Error: lorem ipsum
Project: X
@Whitelist

The idea is, each matching chunk should begin with either of Error, Warning or Message, contain @Whitelist, and end with either of @Comment or an empty line (but the ending part is excluded by the (?=...) feature.)

Note that [^)]* are used not to match against multiple chunks at once (according to your examples, each chunk begins with a number followed by ).)

Regular expression to get text between Indicator and Tag

Answers (2)

Update

Original answer

Related Questions