Roke
Roke

Reputation: 23

Regular expression to get text between Indicator and Tag

I have a report of occurred errors generated from my program and now I want to create some kind of whitelist. Therefore I want to parse all errors which have the tag "@Whitelist"

So the report looks like this:

...
45)
Error: some description
Signal: lorem ipsum
@Whitelist

46)
Error: other description
Signal: lorem ipsum
File:   some file

47)
Error: lorem ipsum
Project: X
@Whitelist
@Comment description why this is sent to the whitelist
...

Here I want to have the Error Nr 45 and 47, but not 46

Ok to sum this up: I am trying to have a regular expression to get everything inbetween (including) the tag "Error" (which can be "Warning" or "Message" too) up to "@Comment" (including the comment tag with the message) and only if @Whitelist is present.

There can be N lines inbetween @Whitelist and the Error indicator

Actually I can't come up with a good solution for this problem, some professional out there? Many thanks in advance

Edit: I just realized that it could be possible that the report can change over time, for example there could be a headline added above a group of errors. Meaning: Error 46 and 47 have the same type, so there would be the line "File Read Errors: " above the Error 46. Thats why I wanted to have some kind of solution where I get the Error based on the Tag "Error|Warning|Message" and "@Whitelist" I hope it is kind of clear what I mean with this

Upvotes: 0

Views: 87

Answers (2)

nodakai
nodakai

Reputation: 8033

Update

One of @op's requirement can be formulated as:

  1. If an error item contains a line beginning with @Comment, anything following that line should be discarded
  2. Otherwise, the output extends to the end of the input

I found it very difficult to fulfill this requirement and ended up with three regexes for:

  1. Distinguishing each error item in the input
  2. Extract items containing Error: etc and @Whitelist (and discard anything before Error: etc)
  3. Process @Comment as stated above

http://ideone.com/E58Ihe

import re

splitter = re.compile(r'\n*(?=\d+\)\n)')
filter = re.compile(
    r'^(Error|Warning|Message):.*@Whitelist.*', re.DOTALL | re.MULTILINE)
cleanup = re.compile(r'(^@Comment[^\n]*).*', re.DOTALL | re.MULTILINE)
for chunk in splitter.split(input_str):
    m = filter.search(chunk)
    if m:
        output = cleanup.sub(r'\1', m.group(0))
        print("Output begin")
        print(output)
        print("Output end\n")

Original answer

Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

import re

regex = re.compile(r'(Error|Warning|Message)[^)]*@Whitelist[^)]*(?=(@Comment|\n\n))')
for m in regex.finditer(input_str):
    print(m.group(0))

Error: some description
Signal: lorem ipsum
@Whitelist
Error: lorem ipsum
Project: X
@Whitelist

The idea is, each matching chunk should begin with either of Error, Warning or Message, contain @Whitelist, and end with either of @Comment or an empty line \n\n (but the ending part is excluded by the (?=...) feature.)

Note that [^)]* are used not to match against multiple chunks at once (according to your examples, each chunk begins with a number followed by ).)

Upvotes: 1

tobias_k
tobias_k

Reputation: 82949

How about this non-regex solution: Just split by double-newline (i.e. empty lines) and see whether that block contains "@Whitelist":

for error in errors.split("\n\n"):
    if "\n@Whitelist" in error:
        print(error)

Or, if there are no blank lines actually, try this:

for error in re.split("\n(?=Error|Warning|Message)", errors):
    ...

IMHO, the more complex your error log becomes, the less likely a single regex is going to help you. Instead, you could use one regex for splitting the error messages and another regex for checking them, but currently not even that seems to be needed.

Upvotes: 1

Related Questions