Reputation: 191
I am trying to write a regex expression to match the extended common log format. I have an expression to match most of the log entries, but it fails when multiple hosts are listed.
Here is my current expression:
([^ ]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"
This successfully matches standard log entries. For example:
24.58.227.240 - - [22/Sep/2011:00:00:00 +0000] "GET /rss/merchant/airsoftpost.com HTTP/1.1" 200 1880 "-" "Apple-PubSub/65"
However, some of the log entries contain multiple host IPs separated by commas:
10.64.233.43, 69.171.229.245 - - [22/Sep/2011:00:00:00 +0000] "GET /view/thesanctuary.co.uk HTTP/1.1" 206 7289 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
Could someone help me fix my expression to match any number of hosts for a given log item?
Thanks.
Upvotes: 0
Views: 217
Reputation: 627469
As an option, try this regex pattern:
([\d.\s,]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"
The first capturing group will now capture all digits, periods, (white)spaces, any number of repetitions.
See working demo.
Upvotes: 1
Reputation: 31035
Following your regex, you can change:
([^ ]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"
To
([^-]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"
^--- here, match until first dash
The idea is to change only the first group:
([^ ]*) ---> matches until the first space (change this)
([^-]*) ---> matches until the first hyphen
Upvotes: 1