Reputation: 104065
I have a bunch of URLs that I have to turn into links:
for my $url (@url_list) {
say "<a href='$url'>$url</a>";
}
Is there a module for making the visible URL nicer? A bit like this:
http://www.foo.com/ → www.foo.com http://www.foo.com/long_path → www.foo.com/lo…
I know a simple regex will probably do here, but I’m spoiled by CPAN. :)
Upvotes: 2
Views: 510
Reputation: 132832
The trick is figuring out how you want to pretty-print each sort of URL, so in that case you need to tell your script what to do in each case:
use URI;
while( <DATA> ) {
chomp;
my $uri = URI->new( $_ );
my $s = $uri->scheme;
my $rest = do {
if( $s =~ /(?:https?|ftp)/ ) {
$uri->host . $uri->path_query
}
elsif( $s eq 'mailto' ) {
$uri->path
}
elsif( ! $s ) {
$uri
}
};
print "$uri -> $rest\n";
}
__END__
http://www.example.com/foo/bar.html
www.example.com/foo/bar.html
ftp://www.example.com
mailto:[email protected]
https://www.example.com/foo?a=b;c=d
http://joe:[email protected]/login
This produces:
http://www.example.com/foo/bar.html -> www.example.com/foo/bar.html
www.example.com/foo/bar.html -> www.example.com/foo/bar.html
ftp://www.example.com -> www.example.com
mailto:[email protected] -> [email protected]
https://www.example.com/foo?a=b;c=d -> www.example.com/foo?a=b;c=d
http://joe:[email protected]/login -> www.example.com/login
If you want something different for a particular URL, you just need to make a branch for it and put together the parts that you want. Notice the URI
also handles schemeless URIs.
If you don't want long URI strings for your pretty printing, you might throw in something like this to cut off the string after so many characters:
substr( $rest, 20 ) = '...' if length $rest > 20;
Here's a solution with given
, which is slightly cleaner, but also slightly uglier. This is the Perl 5.010 version:
use 5.010;
use URI;
while( <DATA> ) {
chomp;
my $uri = URI->new( $_ );
my $r;
given( $uri->scheme ) {
when( /(?:https?|ftp)/ ) { $r = $uri->host . $uri->path_query }
when( 'mailto' ) { $r = $uri->path }
default { $r = $uri }
}
print "$uri -> $r\n";
}
It's uglier because I have to repeat that assignment to $r
. Perl 5.14 is going to fix that though be letting given
have a return value. Since that stable version isn't available yet, you have to use the experimental 5.13 track:
use 5.013004;
use URI;
while( <DATA> ) {
chomp;
my $uri = URI->new( $_ );
my $r = do {
given( $uri->scheme ) {
when( /(?:https?|ftp)/ ) { $uri->host . $uri->path_query }
when( 'mailto' ) { $uri->path }
default { $uri }
}
};
print "$uri -> $r\n";
}
Upvotes: 4
Reputation: 139531
Appendix B of RFC 2396 specifies a regular expression that parses a URI reference. Adapt that a bit to get what you want:
#! /usr/bin/perl
use warnings;
use strict;
use 5.10.0; # for defined-or (//)
my $uri = qr{
^
(?:([^:/?\#]+):)? # scheme = $1
(?://([^/?\#]*))? # authority = $2
([^?\#]*) # path = $3
(\?[^\#]*)? # query = $4
(\#.*)? # fragment = $5
}x;
The code above uses the /x
modifier
It tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The
#
character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code.
but we want to match literal #
characters if they're present, which meant I needed to escape them with backslashes. Out of habit, I started with qr/
but had to change the delimiter because of the slashes in the pattern.
A few test cases:
my @cases = qw(
ftp://www.foo.com.invalid/
http://www.foo.com.invalid/
http://www.foo.com.invalid/long_path
http://www.foo.com.invalid/?query
http://www.foo.com.invalid?query
http://www.foo.com.invalid/#fragment
http://www.foo.com.invalid#fragment
);
A bit of logic
for (@cases) {
my $nice;
if (my($scheme,$auth,$path,@rest) = /$uri/) {
if ($scheme eq "http" && defined $auth) {
if (grep defined, @rest) {
$nice = join "" => map $_ // "" => $auth, $path, @rest;
}
else {
$nice = $auth
. ($path eq "/" ? "" : $path);
}
}
else {
$nice = $_;
}
}
print "$_ → $nice\n";
}
and the output:
ftp://www.foo.com.invalid/ → ftp://www.foo.com.invalid/ http://www.foo.com.invalid/ → www.foo.com.invalid http://www.foo.com.invalid/long_path → www.foo.com.invalid/long_path http://www.foo.com.invalid/?query → www.foo.com.invalid/?query http://www.foo.com.invalid?query → www.foo.com.invalid?query http://www.foo.com.invalid/#fragment → www.foo.com.invalid/#fragment http://www.foo.com.invalid#fragment → www.foo.com.invalid#fragment
Upvotes: 5
Reputation: 21
Part of the joy of Perl is not relying on modules :) I managed the following solution:
#!/usr/bin/perl -w
use strict;
my @url_list = ("<a href=http://www.test.com>www.test.com</a>",
"<a href=http://www.example.com>www.example.com</a>",
"<a href=http://www.this.com>www.this.com</a>");
my ($protocol, $domain_name);
foreach my $url (@url_list) {
$url =~ m|(\w+)://([^/:]+)(:\d+)?/(.*)|;
$protocol = $1;
$domain_name = $2;
my ($url_part, $name_part) = split(/>/, $domain_name);
$name_part =~ s/\<//g;
print $protocol, "://" ,$url_part, " -> ", $name_part , "\n";
}
It's not awesome, and I ended up with a stray <
in the domain name that took a substitute to remove. To answer your original question, you can combine LWP::Simple
and HTML::LinkExtor
to download and parse HTML docs from the web. Powerful combo.
** Disclaimer: Since Ruby and Python, my Perl sucks. Apologies to the purists for brutalizing your language.
Upvotes: -1
Reputation: 91428
I'm not quite sure what you exactly want. I guess you want to strip out http://
and have a shortened url to be displayed. If it's the case you can do something like :
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
my @url_list = ('http://www.foo.com/','http://www.foo.com/long_path');
for my $url (@url_list) {
(my $short = $url) =~ s!\w+://!!;
$short =~ s!/$!!;
$short =~ s!^(.{15}).*$!$1...!;
say "<a href='$url'>$short</a>";
}
Output:
<a href='http://www.foo.com/'>www.foo.com</a>
<a href='http://www.foo.com/long_path'>www.foo.com/lon...</a>
Upvotes: 0