Santrix
Santrix

Reputation: 935

Regex match filename excluding specific extensions

I am trying to count up accesses per minute from apache logs that look like this

domain.com:10.10.10.10 - - [26/Mar/2014:14:14:12 +0000] "GET /online_catalogue/files/flash/libs/framework_4.6.0.23201.swz HTTP/1.0" 200 327044 "http://www.domain.com/online_catalogue/files/flash/flippingbook.swf?key=foobar" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
perl -ne '$a{$1}++ if /\[(.+?:[0-9]{2}:[0-9]{2})/; END { foreach $k(keys %a) { print "$k $a{$k}\n"; } }' logfile | sort

This works, but I want to avoid counting accesses against static files like swz, css, gif, png, jpg etc.

I tried altering the regex to

\[(.+?:[0-9]{2}:[0-9]{2}).+?(?:POST|GET) \/[^ ]+(?!\.swz|\.gif|\.css|\.jpg)

but this still matches. I want to avoid matching them all together.

Upvotes: 0

Views: 722

Answers (2)

Billy Moon
Billy Moon

Reputation: 58521

The [^ ]+ is consuming the filenames, and then the negative look-ahead can be ignored.

Try adding another [^ ] after the negative look-ahead to prevent matches including the entire filename...

\[(.+?:[0-9]{2}:[0-9]{2}).+?(?:POST|GET) \/[^ ]+(?!\.swz|\.gif|\.css|\.jpg)[^ ]

regex diagram

Upvotes: 0

Ulugbek Umirov
Ulugbek Umirov

Reputation: 12797

A little modification to your regex fixes the problem.

\[(.+?:[0-9]{2}:[0-9]{2}).+?(?:POST|GET) \/(?![^ ]+(\.swz|\.gif|\.css|\.jpg))[^ ]+

First we check that it's impossible to match *.swz, *.gif, ... after GET|POST and then capture the filename.

Upvotes: 1

Related Questions