mgeno
mgeno

Reputation: 191

Regex Match Extended Common Log Format with Multiple Host

I am trying to write a regex expression to match the extended common log format. I have an expression to match most of the log entries, but it fails when multiple hosts are listed.

Here is my current expression:

([^ ]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"

This successfully matches standard log entries. For example:

24.58.227.240 - - [22/Sep/2011:00:00:00 +0000] "GET /rss/merchant/airsoftpost.com HTTP/1.1" 200 1880 "-" "Apple-PubSub/65"

However, some of the log entries contain multiple host IPs separated by commas:

10.64.233.43, 69.171.229.245 - - [22/Sep/2011:00:00:00 +0000] "GET /view/thesanctuary.co.uk HTTP/1.1" 206 7289 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"

Could someone help me fix my expression to match any number of hosts for a given log item?

Thanks.

Upvotes: 0

Views: 217

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627469

As an option, try this regex pattern:

([\d.\s,]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"

The first capturing group will now capture all digits, periods, (white)spaces, any number of repetitions.

See working demo.

Upvotes: 1

Federico Piazza
Federico Piazza

Reputation: 31035

Following your regex, you can change:

([^ ]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"

To

([^-]*) ([^ ]*) ([^ ]*) \[([^]]*)\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"
   ^--- here, match until first dash

The idea is to change only the first group:

([^ ]*)   ---> matches until the first space (change this)
([^-]*)   ---> matches until the first hyphen

Upvotes: 1

Related Questions