nixv
nixv

Reputation: 1

Download files with perl lwp linkextractor

I am trying to download a file from a web page.

First I get the links with the linkextractor and then I want to download them with the lwp I'm a newbie programming in perl.

I made the following code ...

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TableExtract;
use HTML::LinkExtractor;
use LWP::Simple qw(get);
use Archive::Zip;

my $html = get $ARGV[0];

my $te = HTML::TableExtract->new(
    keep_html => 1,
    headers => [qw( column1 column2 )],
);
$te->parse($html);

# I get only the first row
my ($row) = $te->rows;

my $LXM = new HTML::LinkExtractor(undef,undef,1);
$LXM->parse(\$$row[0]);
my ($t) = $LXM->links;

my $LXS = new HTML::LinkExtractor(undef,undef,1);
$LXS->parse(\$$row[1]);
my ($s) = $LXS->links;

#-------
for (my $i=0; $i < scalar(@$s); $i++) {
  print "$$s[$i]{_TEXT} $$s[$i]{href} $$t[$i]{href} \n";
  my $file = '/tmp/$$s[$i]{_TEXT}';
  my $url = $$s[$i]{href};
  my $content = getstore($url, $file);
  die "Couldn't get it!" unless defined $content;
}

And I get the following error

Undefined subroutine &main::getstore called at ./geturlfromtable.pl line 35.

Thanks in advance!

Upvotes: 0

Views: 159

Answers (1)

Dave Cross
Dave Cross

Reputation: 69244

LWP::Simple can be loaded in two different ways.

use LWP::Simple;

This loads the module and makes all of its functions available to your program.

use LWP::Simple qw(list of function names);

This loads the module and only makes available the specific set of functions you have requested.

You have this code:

use LWP::Simple qw(get);

This makes the get() function available, but not the getstore() function.

To fix this, either add getstore() to your list of functions.

use LWP::Simple qw(get getstore);

Or (probably simpler) remove the list of functions.

use LWP::Simple;

Update: I hope you don't mind if I add a couple of style points.

Firstly, you're using a really old module - HTML::LinkExtractor. It hasn't been updated for almost fifteen years. I'd recommend looking at HTML::LinkExtor instead.

Secondly, your code uses a lot of references, but you're using them in a really over-complicated way. For example, where you have \$$row[0], you really only need $row->[0]. Similarly, $$s[$i]{href} will be easy for most people to understand if written as $s->[$i]{href}.

Next, you use the C-style for loop and iterate over the array's indexes. It's usually simpler to use foreach to iterate from zero to the last index in the array.

foreach my $i (0 .. $#$s) {
  print "$s->[$i]{_TEXT} $s->[$i]{href} $t->[$i]{href} \n";
  my $file = "/tmp/$s->[$i]{_TEXT}";
  my $url = $s->[$i]{href};
  my $content = getstore($url, $file);
  die "Couldn't get it!" unless defined $content;
}

And finally, you seem slightly confused about what getstore() returns. It returns the HTTP response code. So it will never be undefined. If there's a problem retrieving the content, you'll get 500 or 403 or something like that.

Upvotes: 2

Related Questions