Letokteren
Letokteren

Reputation: 809

Regexp in Grok sometimes catches a value sometimes not

I've a code in grok, which captures messages, and if they meet a given criteria, they get a tag.

My problem is, that sometimes this filter works while testing, and sometimes does not. The regexp in question is the following:

^(?!(?:\d\d\d\d-\d\d-\d\d.\d\d:\d\d:\d\d)).*$

This line checks if the given message does not begin with a given time stamp format. In other words: if the given message does not begin with this time stamp, then it gets a tag.

You can test it yourself with this online application: http://grokconstructor.appspot.com/do/match#result

For these test values, the regepx captures all messages which meets the criteria, so the two lines with "test" are highlighted in green:

test
2016-09-23 18:26:49,714
2016-09-23 18:26:40,244
test

However it captures the first date when the input is something like this:

2016-09-23 18:26:49,714
2016-09-23 18:26:40,244
test

I would like to understand what is the reason behind this behaviour, and how could I prevent it?

Upvotes: 0

Views: 489

Answers (3)

Letokteren
Letokteren

Reputation: 809

It appears to be that at the beginning of some messages there was a BOM (byte order mark) which I could capture with the following regexp in Grok:

^(?:\xEF\xBB\xBF).*&

I could keep this mark on the clip board, but looks like StackOwerflow cuts it down, that's why my example didn't work for everyone.

Upvotes: 2

Laurel
Laurel

Reputation: 6173

I think know what causes the behavior to happen in the online tester, although I'm not sure why it happens or what pattern it follows exactly. (I'm familiar with regex, but nothing else here. Feel free to shed some light on this in the comments, if you know more.)

To replicate put the following as the lines where it says "Some log lines you want to match":

2016-09-23 18:26:49,714
2016-09-23 18:26:40,244
test

In the place where it says "The pattern that should match all logfile lines", put your regex:

^(?!(?:\d\d\d\d-\d\d-\d\d.\d\d:\d\d:\d\d)).*$

Don't mess with the checkboxes (I'm not sure what they do, but they should all be checked).

Hit Go! and you get this result, as revo mentioned in the comments:

works

To get the other result, set up things the exact same way (if you just submitted the regex it should still be set up), but add that same regex to the area that says "If you want to use logstash's multiline filter please specify the used pattern".

Hit Go! and you get this result:

fails

The simple way to avoid this is not to use Logstash's multiline filter. (At least that's what I would assume.)

Upvotes: 1

A_Elric
A_Elric

Reputation: 3568

Why not just match the timestamp in a sensible way? You can match multiple date formats like this:

date {
  match => [ "log_timestamp", "dd/MMM/YYYY HH:mm:ss", "dd/MMM/YYYY HH:mm:ss.SSS" ]
  timezone => "Etc/UTC"
  locale => "en-US"
}

This would match 23/SEP/2016 15:15:00 or 23/SEP/2016 15:15:00.123 (we made a change when we were versioning)

So long as it wouldn't appear elsewhere in the line this should pretty much cover you.

Upvotes: 2

Related Questions