Lucian Tarna
Lucian Tarna

Reputation: 1827

Custom Grok regular expression matcher

I am trying to write a regular expression to parse my log file. They look like this:

I, [2018-03-23T13:30:10.076546 #3107]  INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.03
I, [2018-03-23T13:31:23.488928 #3107]  INFO -- : method='GET' path='/feed/bc822bc19.csv' format= ip='127.0.0.0' status=200 duration=0.04 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:31:30.956484 #3107]  INFO -- : method='GET' path='/feed/ad4d93bee.csv' format= ip='127.0.0.0' status=200 duration=0.05 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:32:10.123399 #3107]  INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.03 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:33:46.362908 #3107]  INFO -- : method='GET' path='/feed/e9cbe2f42e0a6.xml' format= ip='127.0.0.0' status=200 duration=0.02 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:34:10.060682 #3107]  INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.03 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:35:01.445029 #3107]  INFO -- : method='GET' path='/feed/85b91d6f7.xml' format= ip='127.0.0.0' status=200 duration=0.02 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:35:04.486874 #3107]  INFO -- : method='GET' path='/feed/34bda5b6f.csv' format= ip='127.0.0.0' status=200 duration=0.33 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:35:04.609879 #3107]  INFO -- : method='GET' path='/feed/0b4dbb477.xml' format= ip='127.0.0.0' status=200 duration=0.00 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:35:07.441873 #3107]  INFO -- : method='GET' path='/feed/4b494e658.xml' format= ip='127.0.0.0' status=200 duration=0.00 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:35:34.640805 #3107]  INFO -- : method='GET' path='/feed/dbde9d8c5.xml' format= ip='127.0.0.0' status=200 duration=0.02 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:36:09.232026 #3107]  INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.03 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:36:11.494500 #3107]  INFO -- : method='GET' path='/feed/d42267d54.xml' format= ip='127.0.0.0' status=200 duration=0.00 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:38:09.878287 #3107]  INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.01 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:38:32.595255 #3107]  INFO -- : method='GET' path='/feed/4b9badc64.csv' format= ip='127.0.0.0' status=200 duration=0.00 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:38:34.941950 #3107]  INFO -- : method='GET' path='/feed/212ddc50f.csv' format= ip='127.0.0.0' status=200 duration=0.00 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:38:36.658162 #3107]  INFO -- : method='GET' path='/feed/34bcd9d0e.csv' format= ip='127.0.0.0' status=200 duration=0.00 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:38:38.223703 #3107]  INFO -- : method='GET' path='/feed/fe286b188.csv' format= ip='127.0.0.0' status=200 duration=0.00 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:56:29.026273 #3107]  INFO -- : method='GET' path='/feed/c1684e144.csv' format='text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' ip='127.0.0.0' status=200 duration=0.49 host='feeds' user='-' params={} agent='Mozilla/5.0 (X11; Linux x86_64; rv:29.0) Gecko/20100101 Firefox/29.0' protocol='http'

I am am trying to parse it to get the following fields:

timestamp, method, path, format, ip, status, duration, host, user, params, agent and protocol.

I have 0 regular expression knowledge almost so this task is quite hard. I have been trying to write something but... didn't really manage to get it right at all.

This is my attempt:

"no-clue-what-to-write + method=%{WORD:message_method}[]+path=%{WORD:message_path}[]+format=%{WORD:message_format}[]+ip=%{WORD:message_ip}[]+status=%{BASE10NUM:message_status_integer}[ ]+duration=%{BASE10NUM:message_duration_float}[ ]+host=%{WORD:message_host}[]+.*user=%{USERDASH:message_user}[ ]+ip=%{IP:message_ip}[ ]+params=%{WORD:message_params}[]+agent=%{WORD:message_agent}[]+protocol=%{WORD:message_protocol}[]+"

How could I write this to actually work ?

I am trying to test it here: http://grokconstructor.appspot.com/do/match. Is this even ok?

Upvotes: 1

Views: 55

Answers (1)

Sufiyan Ghori
Sufiyan Ghori

Reputation: 18743

Your timestamp is in ISO8601 format which can be matched with pre-defined grok pattern like this, %{TIMESTAMP_ISO8601}.

I have matched rest of the field using either pre-defined pattern, or as WORD. Since some of the fields are blank, I used ? operator to denote "zero or one occurrence of the previous token"

This custom grok pattern should work and will match any of your provided log pattern,

I, \[%{TIMESTAMP_ISO8601} %{DATA} method='%{WORD:method}' path='%{URIPATH:path}' format='(?:%{DATA:format})?' ip='(?:%{IP:ip})?' status=%{INT:status} duration=%{NUMBER:duration:float} host='(?:%{WORD:host})?' user='(?:%{USERNAME})?' params=%{DATA:params} agent='(?:%{DATA:agent})?' protocol='%{URIPROTO}'

here is the output tested in Online grok debugger,

{
  "TIMESTAMP_ISO8601": [
    [
      "2018-03-23T13:31:30.956484"
    ]
  ],
  "YEAR": [
    [
      "2018"
    ]
  ],
  "MONTHNUM": [
    [
      "03"
    ]
  ],
  "MONTHDAY": [
    [
      "23"
    ]
  ],
  "HOUR": [
    [
      "13",
      null
    ]
  ],
  "MINUTE": [
    [
      "31",
      null
    ]
  ],
  "SECOND": [
    [
      "30.956484"
    ]
  ],
  "ISO8601_TIMEZONE": [
    [
      null
    ]
  ],
  "DATA": [
    [
      "#3107]  INFO -- :"
    ]
  ],
  "method": [
    [
      "GET"
    ]
  ],
  "path": [
    [
      "/feed/ad4d93bee.csv"
    ]
  ],
  "format": [
    [
      "a"
    ]
  ],
  "ip": [
    [
      "127.0.0.0"
    ]
  ],
  "IPV6": [
    [
      null
    ]
  ],
  "IPV4": [
    [
      "127.0.0.0"
    ]
  ],
  "status": [
    [
      "200"
    ]
  ],
  "BASE10NUM": [
    [
      "0.05"
    ]
  ],
  "host": [
    [
      "feeds"
    ]
  ],
  "USERNAME": [
    [
      "-"
    ]
  ],
  "params": [
    [
      "{}"
    ]
  ],
  "agent": [
    [
      "saddas"
    ]
  ],
  "URIPROTO": [
    [
      "http"
    ]
  ]
}

hope it helps.

Upvotes: 1

Related Questions