DalNM
DalNM

Reputation: 291

A regex pattern for different tomcat's log entries

I’m a newbie in regex.

If I have the following line from tomcat’s access log file:

123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] \"GET /java/javaResources.html HTTP/1.0\" 200 10450 \"-\" \"Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)\""

The following pattern works fine with entries that look exactly like the one above:

"^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\""

But not all log entries looks exactly like the one above, some time it contains 9 fields, sometimes 7. Example of 9 field entires:

82.132.139.79 - - [14/Jul/2011:18:52:44 +0100] "GET /~roger/cpp/introans.htm HTTP/1.1" 200 11195 "http://www.dcs.bbk.ac.uk/~roger/cpp/intro3.htm" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_2_1 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5"

However, I’m only interested in the IP, date and time and the URL. Is there a pattern that only searches for matching entries from the log entries regardless of their fields’ number?

Upvotes: 4

Views: 3306

Answers (1)

bd808
bd808

Reputation: 1803

The line you give in the example is in the pseudo standard combined log format. This 9 field format extends the widely used common log format with two additional fields: referrer and user-agent.

By making the final two fields optional in your regex you can match lines in either common or combined format:

"^(\\S+) (\\S+) (\\S+) \\[(.*?)\\] \"(.*?)\" (\\S+) (\\S+)( \"(.*?)\" \"(.*?)\")?"

The capture groups are:

  1. remote host
  2. RFC 1413 identity
  3. userid
  4. datetime
  5. request
  6. status
  7. bytes
  8. optional combined fields
  9. referrer
  10. user-agent

This pattern is purposely non-specific on the contents of the specific fields in the log message. Generally when parsing a log you want to extract whatever you can rather than attempt to validate a specification.

Upvotes: 7

Related Questions