user1254916
user1254916

Reputation: 69

URL Regular Expression with Perl

I need to normalise my URL before I store it in the database using Perl regular expressions.

Here are some example URLs:

However, whenever I try the below code, instead of just removing the // after foo in foo//, it will also remove the double slash in http://. I need to keep the // in http://, but I don’t need the forward // after the foo//. I also need to get rid of all the /../ or /./ that can appear any where in the URL.

Basically, this:

"http://www.codeme.com:123/../foo//bar.html"

Should become this:

"http://www.codeme.com/foo/"

I am very new to Perl I always ignored it and thought that i will never need it however life has proven me wrong. I therefore would really appreciate your help if you can lead me to the right track.

sub main
{
        my $line;  
        open(FH, "test.txt");

        until(($line = <FH>) =~ /9/) {

           $line =~ tr/A-Z/a-z/;

           $line =~  s|//|/| ;

           $line =~  s|\:\d\d\d|| ; 

           $line =~  s|:80||;   

            print $line;   
        }

        close FH;
}

Upvotes: 0

Views: 688

Answers (2)

Cfreak
Cfreak

Reputation: 19309

Use the URI module. It will make your life much better and it should be included with Perl by default.

http://metacpan.org/pod/URI

use URI;

my $line;  
open(FH, "test.txt");

until(($line = <FH>) =~ /9/) { 
     chomp($line); # gets rid of the newline character
     my $url = new URI($line);
     print $url->scheme,'://',$url->host,'/',$url->path;
}

It should clean up the url pieces for you.

Also you really don't need sub main. In perl it's implicit.

Edit As @spyroboy pointed out this will not normalize the URL for you. You will still need to normalize the parts through some means but what you want to do with normalization isn't all that clear.

Upvotes: 2

Borodin
Borodin

Reputation: 126722

The URI module, documented here, is the right way to go. It allows you to separate the URL into its component parts and adjust them separately. This Perl program seems to do what you need

use strict;
use warnings;

use URI;

for (
    'http://www.codeme.com:80/foo/../index.php',
    'http://www.codeme.com:123/../foo//bar.html' ) {

  my $uri = URI->new($_);

  $uri->port(80);

  my @path = $uri->path_segments;
  @path = grep /[^.]/, @path;
  $path[-1] = '' if grep $path[-1] eq $_, qw/ default.htm index.php /;
  $uri->path_segments(@path);

  print $uri->canonical, "\n";
}

OUTPUT

http://www.codeme.com/foo/
http://www.codeme.com/foo/bar.html

Upvotes: 0

Related Questions