Reputation: 1223
original re expression given by MOOC instructor is
^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)
I think the catch here is there is an extra space at the end of HTTP/1.0 for bad ones, anyone can hint to make a minor change to make it parse BOTH successfully? I tried to change (\S*) to (?:\s+|$) or (\S.*) and it did not work either way.
127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839
Invalid logline: ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:47:41 -0400] "GET /shuttle/missions/sts-70/mission-sts-70.html HTTP/1.0 " 200 20304
Upvotes: 1
Views: 215
Reputation: 3890
Direct approach
^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)\s?" (\d{3}) (\S+)
Notice \s?
before second "
.
This matches both
127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839
and
ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:47:41 -0400] "GET /shuttle/missions/sts-70/mission-sts-70.html HTTP/1.0 " 200 20304
Upvotes: 1