Juicy
Juicy

Reputation: 12530

Regex for URL that matches a format, but exclude parameters that match that URL format

I'm trying to comb through some logs. I'm looking for logs that have the format http://something/something.php. I currently have this:

https?.*?\.php

The problem with this is that some of my logs have URLs with URLs in their parameters, like this:

http://hello/world.asp?redirect=http://something/else.php
http://hello/blah.asp?abc=/blah/blah.php

Some logs contain multiple parameters, and a URL can be in any, not necessarily at the end of the line. All those get matched as well. In the example above, the actual URL is a .asp, it just happens that it has a .php parameter.

What kind of regex could I use to only match when the actual target is a .php, as opposed to one of its parameters being a URL with a .php.

Upvotes: 2

Views: 300

Answers (2)

Borodin
Borodin

Reputation: 126762

Restricting yourself to a regex solution is never a good idea

Use the URI module to handle URL strings conveniently

Like this

use strict;
use warnings 'all';

use URI;

while ( <DATA> ) {

    chomp;

    my $url = URI->new($_);

    my $ok = $url->scheme =~ /\Ahttps?\z/ && $url->path =~ /\.php\z/;

    printf qq{URL "%s" %s\n}, $url, $ok ? "matches" : "doesn't match";
}

__DATA__
http://something/something.php
http://hello/world.asp?redirect=http://something/else.php
http://hello/blah.asp?abc=/blah/blah.php

output

URL "http://something/something.php" matches
URL "http://hello/world.asp?redirect=http://something/else.php" doesn't match
URL "http://hello/blah.asp?abc=/blah/blah.php" doesn't match

Upvotes: 1

Georg Mavridis
Georg Mavridis

Reputation: 2341

Instead of matching any character in the url - exclude the '?' and start from the beginning (^)

^https?[^\?]*\.php

Upvotes: 0

Related Questions