Jin
Jin

Reputation: 1223

python re matching the "space at the end of string"

original re expression given by MOOC instructor is

^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)

I think the catch here is there is an extra space at the end of HTTP/1.0 for bad ones, anyone can hint to make a minor change to make it parse BOTH successfully? I tried to change (\S*) to (?:\s+|$) or (\S.*) and it did not work either way.

good one below

127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839

bad one below

Invalid logline: ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:47:41 -0400] "GET /shuttle/missions/sts-70/mission-sts-70.html HTTP/1.0 " 200 20304

Upvotes: 1

Views: 215

Answers (1)

lig
lig

Reputation: 3890

Direct approach

^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)\s?" (\d{3}) (\S+)

Notice \s? before second ".

This matches both

127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839

and

ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:47:41 -0400] "GET /shuttle/missions/sts-70/mission-sts-70.html HTTP/1.0 " 200 20304

Upvotes: 1

Related Questions