stevendesu
stevendesu

Reputation: 16831

Determining if a URI is valid using Perl regex

For an application I'm developing I need a Perl script which loops through a massive CSV file and ensures that every single line contains a valid URI. I already asked a question earlier about parsing a CSV file and I have started using Text::CSV to make my life a lot easier. Now I have the issue of ensuring that the URI is valid.

Due to the nature of my application, URIs do not need to take the full form of

protocol://username:[email protected]/request?vars=values

Rather I am only interested in the request portion of this. For a general website, that would be anything after the .com, .edu, etc.

I currently have the following Perl script:

if($_ !~ /^(?:[a-z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*$/i){
    print "Invalid URL format";
    exit;
} else {
    /* stuff */
}

The regex should be fairly straight-forward. The request is allowed to contain either one of a small set of symbols ([a-z0-9-._~!$&'()*+,;=:/?@]) or it may contain a percent sign (%) followed by two hexadecimal digits. Either of these patterns may be repeated indefinitely.

When I run this script I get the following error:

Number found where operator expected at ./301rules.pl line 58, near "%[0"
        (Missing operator before 0?)
Bareword found where operator expected at ./301rules.pl line 58, near "9A"
        (Missing operator before A?)
Bareword found where operator expected at ./301rules.pl line 58, near "$/i"
        (Missing operator before i?)
syntax error at ./301rules.pl line 58, near "%[0"

It's fairly obvious that something in my regex needs to be escaped, however I'm unsure of what. I tried escaping every possible symbol to create the following regex:

if($_ !~ /^(?:[a-z0-9\-\.\_\~\!\$\&\'\(\)\*\+\,\;\=\:\/\?\@]|%[0-9A-F]{2})*$/i){

However when I did this it just allowed every string to pass the test, even strings which I knew are invalid such as te%st or é

So does anyone have experience with Perl regex and know what I need to escape and what I should not escape? With 19 different symbols I don't feel like trying all 2^19 = 524288 possibilities.

EDIT - voting to close. I found out that the issue actually existed immediately above this loop, although I don't entirely understand why yet.

I had:

if( $_ == "" ){
    next;
}
/* regex conditional from above */

For whatever reason it kept evaluating to true and going to the next iteration despite there clearly being data stored in $_. I'll figure out why this was, but for now the regex works fine with everything escaped.

Upvotes: 3

Views: 2165

Answers (3)

DavidO
DavidO

Reputation: 13942

In the documentation for the URI module I found the following:

PARSING URIs WITH REGEXP

As an alternative to this module, the following (official) regular expression can be used to decode a URI:

    my($scheme, $authority, $path,
    $query, $fragment) =   $uri =~
    m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;

The URI::Split module provides the function uri_split() as a readable alternative.

But I think Regexp::Common::URI is probably an ideal solution for syntax validation of an HTTP URI.

use Regexp::Common qw /URI/;
while (<>) {
    /$RE{URI}{HTTP}/  and  print "Contains an HTTP URI.\n";
}

Anything written by Damian and maintained by Abigail has got to be either inspired, great, crazy, or all of the above. (And I mean that with the highest possible regard).

Upvotes: 5

Dim_K
Dim_K

Reputation: 571

You should use rfc regexp for checking EVERY possible character. Look at this

Upvotes: -1

Tudor Constantin
Tudor Constantin

Reputation: 26861

I don't know how you got to your first regex, but I'll try helping you fix that. You only have to escape the characters that have special meaning in regex - from your regex, they are: -,.,$,(,),*,/, so the regex should look like:

if($_ !~ /^(?:[a-z0-9\-\._~!\$&'\(\)\*+,;=:\/?@]|%[0-9A-F]{2})*$/i){

I don't exactly know what ?: is trying to achieve there, but your first character class that is just following it (the expression between the first [] ) is not having any multipliers - maybe it should be followed by a *, a +, or a ?. Also, the | sign I think is meant to do the or between your first character class and the second character class preceded by a % - as it looks right now, it does it beteween the first character class and the % sign only. It probably should be like |(%[0-9A-F]{2}))*$

Upvotes: 2

Related Questions