Reputation: 16831
For an application I'm developing I need a Perl script which loops through a massive CSV file and ensures that every single line contains a valid URI. I already asked a question earlier about parsing a CSV file and I have started using Text::CSV
to make my life a lot easier. Now I have the issue of ensuring that the URI is valid.
Due to the nature of my application, URIs do not need to take the full form of
protocol://username:[email protected]/request?vars=values
Rather I am only interested in the request portion of this. For a general website, that would be anything after the .com
, .edu
, etc.
I currently have the following Perl script:
if($_ !~ /^(?:[a-z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*$/i){
print "Invalid URL format";
exit;
} else {
/* stuff */
}
The regex should be fairly straight-forward. The request is allowed to contain either one of a small set of symbols ([a-z0-9-._~!$&'()*+,;=:/?@]
) or it may contain a percent sign (%
) followed by two hexadecimal digits. Either of these patterns may be repeated indefinitely.
When I run this script I get the following error:
Number found where operator expected at ./301rules.pl line 58, near "%[0"
(Missing operator before 0?)
Bareword found where operator expected at ./301rules.pl line 58, near "9A"
(Missing operator before A?)
Bareword found where operator expected at ./301rules.pl line 58, near "$/i"
(Missing operator before i?)
syntax error at ./301rules.pl line 58, near "%[0"
It's fairly obvious that something in my regex needs to be escaped, however I'm unsure of what. I tried escaping every possible symbol to create the following regex:
if($_ !~ /^(?:[a-z0-9\-\.\_\~\!\$\&\'\(\)\*\+\,\;\=\:\/\?\@]|%[0-9A-F]{2})*$/i){
However when I did this it just allowed every string to pass the test, even strings which I knew are invalid such as te%st
or é
So does anyone have experience with Perl regex and know what I need to escape and what I should not escape? With 19 different symbols I don't feel like trying all 2^19 = 524288 possibilities.
EDIT - voting to close. I found out that the issue actually existed immediately above this loop, although I don't entirely understand why yet.
I had:
if( $_ == "" ){
next;
}
/* regex conditional from above */
For whatever reason it kept evaluating to true and going to the next iteration despite there clearly being data stored in $_
. I'll figure out why this was, but for now the regex works fine with everything escaped.
Upvotes: 3
Views: 2165
Reputation: 13942
In the documentation for the URI
module I found the following:
PARSING URIs WITH REGEXP
As an alternative to this module, the following (official) regular expression can be used to decode a URI:
my($scheme, $authority, $path, $query, $fragment) = $uri =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
The URI::Split module provides the function uri_split() as a readable alternative.
But I think Regexp::Common::URI is probably an ideal solution for syntax validation of an HTTP URI.
use Regexp::Common qw /URI/;
while (<>) {
/$RE{URI}{HTTP}/ and print "Contains an HTTP URI.\n";
}
Anything written by Damian and maintained by Abigail has got to be either inspired, great, crazy, or all of the above. (And I mean that with the highest possible regard).
Upvotes: 5
Reputation: 571
You should use rfc regexp for checking EVERY possible character. Look at this
Upvotes: -1
Reputation: 26861
I don't know how you got to your first regex, but I'll try helping you fix that. You only have to escape the characters that have special meaning in regex - from your regex, they are: -,.,$,(,),*,/, so the regex should look like:
if($_ !~ /^(?:[a-z0-9\-\._~!\$&'\(\)\*+,;=:\/?@]|%[0-9A-F]{2})*$/i){
I don't exactly know what ?:
is trying to achieve there, but your first character class that is just following it (the expression between the first []
) is not having any multipliers - maybe it should be followed by a *, a +, or a ?. Also, the |
sign I think is meant to do the or
between your first character class and the second character class preceded by a %
- as it looks right now, it does it beteween the first character class and the %
sign only. It probably should be like |(%[0-9A-F]{2}))*$
Upvotes: 2