Jestfer
Jestfer

Reputation: 9

Getting Absolute URLs with module creating object outside loop

I have a doubt I've been trying to solve myself using CPAN modules documentation, but I'm a bit new and I'm confused with some terminology and sections within the different modules.

I'm trying to create the object in the code below, and get the absolute URL for relative links extracted from a website.

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;         
use Digest::MD5 qw(md5_hex);
use URI;

my $url = $ARGV[0];

if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
    exit(0);                         
}      

my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );

my $response = $ua->get( $url );  

my $content = $response->decoded_content();

my $links = URI->new($content);
my $abs = $links->abs('http:', $content);
my $abs_links = $links->abs($abs);

while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) {
    $abs_links = $1;
    print "$abs_links\n";
    print "Digest for the above URL is " . md5_hex($abs_links) . "\n";             
}

The problem is when I try to add that part outside the While loop (the 3-line block preceding the loop), it does not work, whereas if I add the same part in the While loop, it will work fine. This one just gets the relative URLs from a given website, but instead of printing "Http://..." it prints "//...".

The script that works fine for me is the following:

#!/usr/bin/perl
use strict;
use warnings;

use LWP::UserAgent;            
use Digest::MD5 qw(md5_hex);
use URI::URL;

my $url = $ARGV[0];                            ## Url passed in command
if ($url !~ m{^https?://[\w]+-?[\w]+\.com/?}i) {
    exit(0);                                   ## Program stops if not valid URL
}         

my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );

my $response = $ua->get( $url );               ## Get response, not content

my $content = $response->decoded_content();    ## Now let's get the content

while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) {    ## All links
    my $links = $1;
    my $abs = new URI::URL "$links";
    my $abs_url = $abs->abs('http:', $links);
    print "$abs_url\n";
    print "Digest for the above URL is " . md5_hex($abs_url) . "\n";              
} 

Any ideas? Much appreciated.

Upvotes: 0

Views: 96

Answers (2)

Dave Cross
Dave Cross

Reputation: 69274

I think your biggest mistake is trying to parse links out of HTML using a regular expression. You would be far better advised to use a CPAN module for this. I'd recommend WWW::Mechanize, which would make your code look something like this:

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';

use WWW::Mechanize;         
use Digest::MD5 qw(md5_hex);
use URI;

my $url = $ARGV[0];

if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
    exit(0);                         
}      

my $ua = WWW::Mechanize->new;
$ua->timeout( 10 );

$ua->get( $url );  

foreach ($ua->links) {
  say $_->url;
  say "Digest for the above URL is " . md5_hex($_->url) . "\n";
}

That looks a lot simpler to me.

Upvotes: 1

melpomene
melpomene

Reputation: 85767

I don't understand your code. There are a few weird bits:

  • [^\W] is the same as \w
  • The regex allows an optional - before and an optional / after .com, i.e. http://bitwise.complement.biz matches but http://cool-beans.com doesn't.
  • URI->new($content) makes no sense: $content is random HTML, not a URI.
  • $links->abs('http:', $content) makes no sense: $content is simply ignored, and $links->abs('http:') tries to make $links an absolute URL relative to 'http:', but 'http:' is not a valid URL.

Here's what I think you're trying to do:

#!/usr/bin/perl
use strict;
use warnings;

use LWP::UserAgent;
use HTML::LinkExtor;
use Digest::MD5 qw(md5_hex);

@ARGV == 1 or die "Usage: $0 URL\n";
my $url = $ARGV[0];

my $ua = LWP::UserAgent->new(timeout => 10);

my $response = $ua->get($url);
$response->is_success or die "$0: " . $response->request->uri . ": " . $response->status_line . "\n";

my $content = $response->decoded_content;
my $base = $response->base;

my @links;
my $p = HTML::LinkExtor->new(
    sub {
        my ($tag, %attrs) = @_;
        if ($tag eq 'a' && $attrs{href}) {
            push @links, "$attrs{href}";  # stringify
        }
    },
    $base,
);

$p->parse($content);
$p->eof;

for my $link (@links) {
    print "$link\n";
    print "Digest for the above URL is " . md5_hex($link) . "\n";
}
  • I don't try to validate the URL passed in $ARGV[0]. Leave it to LWP::UserAgent. (If you don't like this, just add the check back in.)
  • I make sure $ua->get($url) was successful before proceeding.
  • I get the base URL for absolutifying relative links from $response->base.
  • I use HTML::LinkExtor for parsing the content, extracting links, and making them absolute.

Upvotes: 1

Related Questions