Reputation: 69
I need to normalise my URL before I store it in the database using Perl regular expressions.
Here are some example URLs:
However, whenever I try the below code, instead of just removing the //
after foo in
foo//
, it will also remove the double slash in http://
. I need to keep the //
in http://
, but I don’t need the forward //
after the foo//
. I also need to get rid of all the /../
or /./
that can appear any where in the URL.
Basically, this:
"http://www.codeme.com:123/../foo//bar.html"
Should become this:
"http://www.codeme.com/foo/"
I am very new to Perl I always ignored it and thought that i will never need it however life has proven me wrong. I therefore would really appreciate your help if you can lead me to the right track.
sub main
{
my $line;
open(FH, "test.txt");
until(($line = <FH>) =~ /9/) {
$line =~ tr/A-Z/a-z/;
$line =~ s|//|/| ;
$line =~ s|\:\d\d\d|| ;
$line =~ s|:80||;
print $line;
}
close FH;
}
Upvotes: 0
Views: 688
Reputation: 19309
Use the URI module. It will make your life much better and it should be included with Perl by default.
use URI;
my $line;
open(FH, "test.txt");
until(($line = <FH>) =~ /9/) {
chomp($line); # gets rid of the newline character
my $url = new URI($line);
print $url->scheme,'://',$url->host,'/',$url->path;
}
It should clean up the url pieces for you.
Also you really don't need sub main
. In perl it's implicit.
Edit As @spyroboy pointed out this will not normalize the URL for you. You will still need to normalize the parts through some means but what you want to do with normalization isn't all that clear.
Upvotes: 2
Reputation: 126722
The URI
module, documented here, is the right way to go. It allows you to separate the URL into its component parts and adjust them separately. This Perl program seems to do what you need
use strict;
use warnings;
use URI;
for (
'http://www.codeme.com:80/foo/../index.php',
'http://www.codeme.com:123/../foo//bar.html' ) {
my $uri = URI->new($_);
$uri->port(80);
my @path = $uri->path_segments;
@path = grep /[^.]/, @path;
$path[-1] = '' if grep $path[-1] eq $_, qw/ default.htm index.php /;
$uri->path_segments(@path);
print $uri->canonical, "\n";
}
OUTPUT
http://www.codeme.com/foo/
http://www.codeme.com/foo/bar.html
Upvotes: 0