Reputation: 291
I’m a newbie in regex.
If I have the following line from tomcat’s access log file:
123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] \"GET /java/javaResources.html HTTP/1.0\" 200 10450 \"-\" \"Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)\""
The following pattern works fine with entries that look exactly like the one above:
"^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\""
But not all log entries looks exactly like the one above, some time it contains 9 fields, sometimes 7. Example of 9 field entires:
82.132.139.79 - - [14/Jul/2011:18:52:44 +0100] "GET /~roger/cpp/introans.htm HTTP/1.1" 200 11195 "http://www.dcs.bbk.ac.uk/~roger/cpp/intro3.htm" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_2_1 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5"
However, I’m only interested in the IP, date and time and the URL. Is there a pattern that only searches for matching entries from the log entries regardless of their fields’ number?
Upvotes: 4
Views: 3306
Reputation: 1803
The line you give in the example is in the pseudo standard combined log format. This 9 field format extends the widely used common log format with two additional fields: referrer and user-agent.
By making the final two fields optional in your regex you can match lines in either common or combined format:
"^(\\S+) (\\S+) (\\S+) \\[(.*?)\\] \"(.*?)\" (\\S+) (\\S+)( \"(.*?)\" \"(.*?)\")?"
The capture groups are:
This pattern is purposely non-specific on the contents of the specific fields in the log message. Generally when parsing a log you want to extract whatever you can rather than attempt to validate a specification.
Upvotes: 7