Reputation: 12530
I'm trying to comb through some logs. I'm looking for logs that have the format http://something/something.php
. I currently have this:
https?.*?\.php
The problem with this is that some of my logs have URLs with URLs in their parameters, like this:
http://hello/world.asp?redirect=http://something/else.php
http://hello/blah.asp?abc=/blah/blah.php
Some logs contain multiple parameters, and a URL can be in any, not necessarily at the end of the line. All those get matched as well. In the example above, the actual URL is a .asp
, it just happens that it has a .php
parameter.
What kind of regex could I use to only match when the actual target is a .php
, as opposed to one of its parameters being a URL with a .php
.
Upvotes: 2
Views: 300
Reputation: 126762
Restricting yourself to a regex solution is never a good idea
Use the URI
module to handle URL strings conveniently
Like this
use strict;
use warnings 'all';
use URI;
while ( <DATA> ) {
chomp;
my $url = URI->new($_);
my $ok = $url->scheme =~ /\Ahttps?\z/ && $url->path =~ /\.php\z/;
printf qq{URL "%s" %s\n}, $url, $ok ? "matches" : "doesn't match";
}
__DATA__
http://something/something.php
http://hello/world.asp?redirect=http://something/else.php
http://hello/blah.asp?abc=/blah/blah.php
URL "http://something/something.php" matches
URL "http://hello/world.asp?redirect=http://something/else.php" doesn't match
URL "http://hello/blah.asp?abc=/blah/blah.php" doesn't match
Upvotes: 1
Reputation: 2341
Instead of matching any character in the url - exclude the '?' and start from the beginning (^)
^https?[^\?]*\.php
Upvotes: 0