Reputation: 104065

How can I format URLs nicely in Perl?

I have a bunch of URLs that I have to turn into links:

for my $url (@url_list) {
    say "<a href='$url'>$url</a>";
}

Is there a module for making the visible URL nicer? A bit like this:

http://www.foo.com/ → www.foo.com
http://www.foo.com/long_path → www.foo.com/lo…

I know a simple regex will probably do here, but I’m spoiled by CPAN. :)

Upvotes: 2

Answers (5)

brian d foy

Reputation: 132832

The trick is figuring out how you want to pretty-print each sort of URL, so in that case you need to tell your script what to do in each case:

use URI;

while( <DATA> ) {
    chomp;
    my $uri = URI->new( $_ );

    my $s = $uri->scheme;
    my $rest = do {
        if( $s =~ /(?:https?|ftp)/ ) {
            $uri->host . $uri->path_query
            }
        elsif( $s eq 'mailto' ) {
            $uri->path
            }
        elsif( ! $s ) {
            $uri
            }
        };

    print "$uri -> $rest\n";
    }

__END__
http://www.example.com/foo/bar.html
www.example.com/foo/bar.html
ftp://www.example.com
mailto:[email protected]
https://www.example.com/foo?a=b;c=d
http://joe:[email protected]/login

This produces:

http://www.example.com/foo/bar.html -> www.example.com/foo/bar.html
www.example.com/foo/bar.html -> www.example.com/foo/bar.html
ftp://www.example.com -> www.example.com
mailto:[email protected] -> [email protected]
https://www.example.com/foo?a=b;c=d -> www.example.com/foo?a=b;c=d
http://joe:[email protected]/login -> www.example.com/login

If you want something different for a particular URL, you just need to make a branch for it and put together the parts that you want. Notice the URI also handles schemeless URIs.

If you don't want long URI strings for your pretty printing, you might throw in something like this to cut off the string after so many characters:

substr( $rest, 20 ) = '...' if length $rest > 20;

Here's a solution with given, which is slightly cleaner, but also slightly uglier. This is the Perl 5.010 version:

use 5.010;
use URI;

while( <DATA> ) {
    chomp;
    my $uri = URI->new( $_ );

    my $r;
    given( $uri->scheme ) {
        when( /(?:https?|ftp)/  ) { $r = $uri->host . $uri->path_query }
        when( 'mailto' )          { $r = $uri->path }       
        default                   { $r = $uri }
        }


    print "$uri -> $r\n";
    }

It's uglier because I have to repeat that assignment to $r. Perl 5.14 is going to fix that though be letting given have a return value. Since that stable version isn't available yet, you have to use the experimental 5.13 track:

use 5.013004;
use URI;

while( <DATA> ) {
    chomp;
    my $uri = URI->new( $_ );

    my $r = do {
        given( $uri->scheme ) {
            when( /(?:https?|ftp)/  ) { $uri->host . $uri->path_query }
            when( 'mailto' )          { $uri->path }        
            default                   { $uri }
            }
        };

    print "$uri -> $r\n";
    }

Upvotes: 4

Greg Bacon

Reputation: 139531

Appendix B of RFC 2396 specifies a regular expression that parses a URI reference. Adapt that a bit to get what you want:

#! /usr/bin/perl

use warnings;
use strict;

use 5.10.0;  # for defined-or (//)

my $uri = qr{
  ^
  (?:([^:/?\#]+):)?  # scheme = $1
  (?://([^/?\#]*))?  # authority = $2
  ([^?\#]*)          # path = $3
  (\?[^\#]*)?        # query = $4
  (\#.*)?            # fragment = $5
}x;

The code above uses the /x modifier

It tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The # character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code.

but we want to match literal # characters if they're present, which meant I needed to escape them with backslashes. Out of habit, I started with qr/ but had to change the delimiter because of the slashes in the pattern.

A few test cases:

my @cases = qw(
  ftp://www.foo.com.invalid/
  http://www.foo.com.invalid/
  http://www.foo.com.invalid/long_path
  http://www.foo.com.invalid/?query
  http://www.foo.com.invalid?query
  http://www.foo.com.invalid/#fragment
  http://www.foo.com.invalid#fragment
);

A bit of logic

for (@cases) {
  my $nice;
  if (my($scheme,$auth,$path,@rest) = /$uri/) {
    if ($scheme eq "http" && defined $auth) {
      if (grep defined, @rest) {
        $nice = join "" => map $_ // "" => $auth, $path, @rest;
      }
      else {
        $nice = $auth
              . ($path eq "/" ? "" : $path);
      }
    }
    else {
      $nice = $_;
    }
  }

  print "$_ → $nice\n";
}

and the output:

ftp://www.foo.com.invalid/ → ftp://www.foo.com.invalid/
http://www.foo.com.invalid/ → www.foo.com.invalid
http://www.foo.com.invalid/long_path → www.foo.com.invalid/long_path
http://www.foo.com.invalid/?query → www.foo.com.invalid/?query
http://www.foo.com.invalid?query → www.foo.com.invalid?query
http://www.foo.com.invalid/#fragment → www.foo.com.invalid/#fragment
http://www.foo.com.invalid#fragment → www.foo.com.invalid#fragment

Upvotes: 5

Paul Stevens

Reputation: 21

Part of the joy of Perl is not relying on modules :) I managed the following solution:


#!/usr/bin/perl -w

use strict;

my @url_list = ("<a href=http://www.test.com>www.test.com</a>",
                "<a href=http://www.example.com>www.example.com</a>",
                "<a href=http://www.this.com>www.this.com</a>");

my ($protocol, $domain_name);

foreach my $url (@url_list) {
    $url =~ m|(\w+)://([^/:]+)(:\d+)?/(.*)|;
    $protocol = $1;
    $domain_name = $2;
    my ($url_part, $name_part) = split(/>/, $domain_name);
    $name_part =~ s/\<//g;
    print $protocol, "://" ,$url_part, " -> ", $name_part  , "\n";
}

It's not awesome, and I ended up with a stray < in the domain name that took a substitute to remove. To answer your original question, you can combine LWP::Simple and HTML::LinkExtor to download and parse HTML docs from the web. Powerful combo.

** Disclaimer: Since Ruby and Python, my Perl sucks. Apologies to the purists for brutalizing your language.

Upvotes: -1

Toto

Reputation: 91428

I'm not quite sure what you exactly want. I guess you want to strip out http:// and have a shortened url to be displayed. If it's the case you can do something like :

#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;


my @url_list = ('http://www.foo.com/','http://www.foo.com/long_path');

for my $url (@url_list) {
    (my $short = $url) =~ s!\w+://!!;
    $short =~ s!/$!!;
    $short =~ s!^(.{15}).*$!$1...!;
    say "<a href='$url'>$short</a>";
}

Output:

<a href='http://www.foo.com/'>www.foo.com</a>
<a href='http://www.foo.com/long_path'>www.foo.com/lon...</a>

Upvotes: 0

Powertieke

Reputation: 2408

Try the URI module from cpan.

Upvotes: 1

How can I format URLs nicely in Perl?

Answers (5)

Related Questions