Reputation: 21
I am trying to use regular expressions to split the date & the remainder of log files - quite simple I thought (very good because I no very little of regex)
The first line...
17 00000002 2011-05-02 22:39:14 StringID "Custom_Task IDS_ENUM_Task_262144_0" not found for locale []
works fine with
.*00000002 (.*) (.*)
(there are two spaces surrounding the date) this groups "2011-05-02 22:39:14" and "StringID "Custom_Task IDS_ENUM_Task_262144_0" not found for locale []"
But I ran into a problem with lines like the following;
17 00000002 2011-04-05 10:46:53 Warning: Server component Requirement.SSC failed to load. Please ensure that the server is properly licensed.
Please
are causing it to make group 1 as "2011-04-05 10:46:53 Warning: Server component Requirement.SSC failed to load."Obviously, if I try to parse that as a date it is failing.
Any suggestions? As I mentioned I am really not familiar with regex and it is probably staring me in the face :-)
All I need is the date-time as group 1 and the remainder of the line as group 2
And yes, I know I could just chop the line up from particular characters, but there are two reasons for this
Upvotes: 2
Views: 4657
Reputation: 101
So if I get this right, you want the date and what's after it?
What tool do you use your regex with? Sed? Perl?
Are the first two fields always similar? Seems you have more spaces now between the first two fields?
17 00000002 2011-04-05 10:46:53 Warning: Server component Requirement.SSC failed to load. Please ensure that the server is properly licensed.
With perl you could do a cat myfile | perl -pe 's/^(?:\S+\s+){2}(\S+\s\S+)\s+(.*)/$1 ## $2/'
Where:
(?:\S+\s+){2}
means I want 2 times \S+\s+ which is non-space characters followed by space characters (?: means don't capture)
(\S+\s\S+)
matches your date: non space characters followed by one space followed by more non space chars
\s+
some spaces
(.*)
the rest
It will always work but depending on what your data really look like, we could make it better...
Upvotes: 0
Reputation: 45634
Something like this:
\d+\s+\d+\s+([0-9-]+)
or
00000002\s+([0-9-]+)
See it in action at rubular
Upvotes: 0
Reputation: 185663
Your problem is that the splat operator is "greedy", i.e. it matches as many characters as possible. You want to make it "non-greedy", so it matches as few characters instead. You can do this by putting a ?
after the *
, e.g
00000002 (.*?) (.*)
I also took the liberty of removing the leading .*
, because regexes default to unanchored.
An alternative solution is to try and match the format of the date instead of using (.*?)
, so you no longer rely on the double spaces as a delimiter. Assuming all your dates look like YYYY-MM-DD HH:MM:SS
you can do this with the following:
(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)\s+(.*)
Upvotes: 2