Mr Burtman
Mr Burtman

Reputation: 21

Parsing log files with regular expressions

I am trying to use regular expressions to split the date & the remainder of log files - quite simple I thought (very good because I no very little of regex)

The first line...

17      00000002  2011-05-02 22:39:14  StringID "Custom_Task IDS_ENUM_Task_262144_0" not found for locale []    

works fine with

.*00000002  (.*)  (.*)

(there are two spaces surrounding the date) this groups "2011-05-02 22:39:14" and "StringID "Custom_Task IDS_ENUM_Task_262144_0" not found for locale []"

But I ran into a problem with lines like the following;

17      00000002  2011-04-05 10:46:53  Warning: Server component Requirement.SSC failed to load.  Please ensure that the server is properly licensed.

Obviously, if I try to parse that as a date it is failing.

Any suggestions? As I mentioned I am really not familiar with regex and it is probably staring me in the face :-)

All I need is the date-time as group 1 and the remainder of the line as group 2

And yes, I know I could just chop the line up from particular characters, but there are two reasons for this

  1. the files being read are huge & regex is much fast than left(substring(right(length-43 etc,etc :-)
  2. the lenght of the date could be determined by the locale settings the user has implemented - but I 'know' there will always be two spaces preceding and after the date section.

Upvotes: 2

Views: 4657

Answers (3)

Mathieu
Mathieu

Reputation: 101

So if I get this right, you want the date and what's after it?

What tool do you use your regex with? Sed? Perl?

Are the first two fields always similar? Seems you have more spaces now between the first two fields?

17 00000002 2011-04-05 10:46:53 Warning: Server component Requirement.SSC failed to load. Please ensure that the server is properly licensed.

With perl you could do a cat myfile | perl -pe 's/^(?:\S+\s+){2}(\S+\s\S+)\s+(.*)/$1 ## $2/'

Where:

(?:\S+\s+){2} means I want 2 times \S+\s+ which is non-space characters followed by space characters (?: means don't capture)

(\S+\s\S+) matches your date: non space characters followed by one space followed by more non space chars

\s+ some spaces

(.*) the rest

It will always work but depending on what your data really look like, we could make it better...

Upvotes: 0

Fredrik Pihl
Fredrik Pihl

Reputation: 45634

Something like this:

\d+\s+\d+\s+([0-9-]+)

or

00000002\s+([0-9-]+)

See it in action at rubular

Upvotes: 0

Lily Ballard
Lily Ballard

Reputation: 185663

Your problem is that the splat operator is "greedy", i.e. it matches as many characters as possible. You want to make it "non-greedy", so it matches as few characters instead. You can do this by putting a ? after the *, e.g

00000002  (.*?)  (.*)

I also took the liberty of removing the leading .*, because regexes default to unanchored.

An alternative solution is to try and match the format of the date instead of using (.*?), so you no longer rely on the double spaces as a delimiter. Assuming all your dates look like YYYY-MM-DD HH:MM:SS you can do this with the following:

(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)\s+(.*)

Upvotes: 2

Related Questions