Kapé
Kapé

Reputation: 4801

Parse multiline log entries using a regex

I'm trying to parse log entries in a C# app using this regex: (^[0-9]{4}(-[0-9]{2}){2}([^|]+\|){3})(?!\1) for logs in a format like [date (in some format)] | [level] | [appname] | [message].

Where (I think):

For example, I have the following 4 log entries (separated by a newline for clarification):

2015-03-03 19:30:47.2725|INFO|MyApp|This is a single line log message.

2015-03-03 19:31:29.1209|INFO|MyApp|This log message has multiple
lines with
2015-03-03
a date in it.

2015-03-03 19:32:50.1106|INFO|MyApp|This log message has
multiple lines
but just text only.

2015-03-03 19:33:20.2683|ERROR|MyApp|This log message has multiple lines but
also some confusing text like
2015-03-03 19:33:20.2683|ERROR| which should
still be a valid log message.

But the regex does not capture the message when I test it on regex101, probably because I don't understand how to capture the negative lookahead.

If I include .* in the regex: (^[0-9]{4}(-[0-9]{2}){2}([^|]+\|){3}).*(?!\1) it matches the message but only a single line (because . does not match a newline).

So how can I capture the (multiline) message?

Upvotes: 6

Views: 3564

Answers (3)

user557597
user557597

Reputation:

Something like this should work.
See the comments in the regex.
(mod: make line break optional for EOS or single line message)

 @"(?m)^[0-9]{4}(?:-[0-9]{2}){2}(?:[^|\r\n]+\|){3}((?:(?!^[0-9]{4}(?:-[0-9]{2}){2}(?:[^|\r\n]+\|){3}).*(?:\r?\n)?)+)"

Formatted (with this):

 (?m)                          # Modifier - multiline
 ^                             # BOL
 [0-9]{4}                      # Message header
 (?: - [0-9]{2} ){2}
 (?: [^|\r\n]+ \| ){3}
 (                             # (1 start), The Message
      (?:
           (?!                           # Assert, not a Message header
                ^                             # BOL
                [0-9]{4} 
                (?: - [0-9]{2} ){2}
                (?: [^|\r\n]+ \| ){3}
           )
           .*                            # Line is ok, its part of the message
           (?: \r? \n )?                 # Optional line break
      )+
 )                             # (1 end)

Output:

 **  Grp 0 -  ( pos 0 , len 74 ) 
2015-03-03 19:30:47.2725|INFO|MyApp|This is a single line log message.


 **  Grp 1 -  ( pos 36 , len 38 ) 
This is a single line log message.

--------------

 **  Grp 0 -  ( pos 74 , len 108 ) 
2015-03-03 19:31:29.1209|INFO|MyApp|This log message has multiple
lines with
2015-03-03
a date in it.


 **  Grp 1 -  ( pos 110 , len 72 ) 
This log message has multiple
lines with
2015-03-03
a date in it.

--------------

 **  Grp 0 -  ( pos 182 , len 97 ) 
2015-03-03 19:32:50.1106|INFO|MyApp|This log message has
multiple lines
but just text only.


 **  Grp 1 -  ( pos 218 , len 61 ) 
This log message has
multiple lines
but just text only.

--------------

 **  Grp 0 -  ( pos 279 , len 186 ) 
2015-03-03 19:33:20.2683|ERROR|MyApp|This log message has multiple lines but
also some confusing text like
2015-03-03 19:33:20.2683|ERROR| which should
still be a valid log message.

 **  Grp 1 -  ( pos 316 , len 149 ) 
This log message has multiple lines but
also some confusing text like
2015-03-03 19:33:20.2683|ERROR| which should
still be a valid log message.

Upvotes: 4

anubhava
anubhava

Reputation: 785471

You can use this regex:

(^\d{4}(-\d{2}){2}([^|]+\|){3})([\s\S]*?)\n*(?=^\d{4}.*?(?:[^|\n]+\|){3}|\z)

RegEx Demo

This regex should work in C# as well, just make sure to use MULTILINE flag.

Upvotes: 3

Necreaux
Necreaux

Reputation: 9786

What regex engine are you using? In Java for example there is a flag to tell "." to match newline characters.

The following regex appears to do the trick:

/(([0-9]{4})(-[0-9]{2}){2}([^|]+\|){3})((.(?!\2))*)/sg

Modifications I made to your query were mostly some cleanup (your date capturing group was wrong). I then added a . and * in that final capturing group. https://regex101.com/r/fU1vV1/2

The most important part is the use of the sg flags. g makes it get all matches. s makes it treat it all like a single line (otherwise your negative lookahead would never work). All of this would be unnecessary if you could guarantee the comments were on one line (which they are in your example) since you could just capture to the end of the line.

Upvotes: -2

Related Questions