rhellem
rhellem

Reputation: 869

Regex - Match the nth number of char, then stop (non-greedy)

Trying to capture the timestamp in this log event (for Splunk)

172.21.201.135 | http | o@1I0BTOx1063x3667295x0 | hkv | 2020-06-10 17:43:18,951 | "POST /rest/build-status/latest/commits/stats HTTP/1.1" | "http://bitbucket.my.com/projects/WF/repos/klp-libs/compare/commits" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36" | 200 | 345 | 431 | - | 5 | 3dk4qm | 

Using the setting TIME_PREFIX, Splunk software uses the specified regular expression to looks for a match before attempting to extract a timestamp.

TIME_PREFIX = <regular expression>  

Default behaviour would be for Splunk to try to get the timestamp from the start of the line, but that is an IP-adress, therefore the need for the regex to match four pipes which is the ...time_prefix.

By using the following regex

(?:[^\|]*(\|)){4}

I want the regex to match on the fourth occurence of the '|', and then stop, non-greedy I guess.

Upvotes: 1

Views: 579

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626709

There are two things to consider:

  • Anchor the pattern at the start of the string, else, the environment may trigger a regex search at every position inside the string, and you may get many more matches than you expect

  • When you do not need to create captures, i.e. when you needn't save part of the regex match to a separate memory buffer (in Splunk, the is equal to creating a separate field), you should use a non-capturing group rather than a capturing one when grouping a sequence of patterns.

Thus, you need

^(?:[^|]*\|){4}\s*

See the regex demo showing the match extends to the datetime substring without matching it.

Details

  • ^ - start of string anchor
  • (?:[^|]*\|){4} - a non-capturing group ((?:...)) that matches four repetitions ({4}) of any 0 or more chars other than | ([^|]*) and then a | char (\|)
  • \s* - 0 or more whitespaces.

Upvotes: 1

Related Questions