Reputation: 1
I am using Strawberry Perl on Windows XP to download multiple html pages, I want each in a variable.
Right now I am doing this, but as I see it, it gets one page at a time:
my $page = `curl -s http://mysite.com/page -m 2`;
my $page2 = `curl -s http://myothersite.com/page -m 2`;
I looked into Parallel::ForkManager, but couldnt get it to work.
Also tried to use the windows command start
before curl
but that doesn't get the page.
Is there a more simple way to do this?
Upvotes: 0
Views: 259
Reputation: 126762
The Parallel::ForkManager
module should work for you, but because it uses fork
instead of threads, the variables in the parent and each of the child processses is separate and they must communicate a different way.
This program uses the -o
option of curl
to save the pages in files. The file for, say, http://mysite.com/page
is saved in file http\mysite.com\page
and can be retrieved from there by the parent process.
use strict;
use warnings;
use Parallel::ForkManager;
use URI;
use File::Spec;
use File::Path 'make_path';
my $pm = Parallel::ForkManager->new(10);
foreach my $site (qw( http://mysite.com/page http://myothersite.com/page )) {
my $pid = $pm->start;
next if $pid;
fetch($site);
$pm->finish;
}
$pm->wait_all_children;
sub fetch {
my ($url) = @_;
my $uri = URI->new($url);
my $filename = File::Spec->catfile($uri->scheme, $uri->host, $uri->path);
my ($vol, $dir, $file) = File::Spec->splitpath($filename);
make_path $dir;
print `curl http://mysite.com/page -m 2 -o $filename`;
}
Update
Here is a version that uses threads
with threads::shared
to return each page into a hash shared between all the threads. The hash must be marked as shared, and locked before it is modified to prevent concurrent access.
use strict;
use warnings;
use threads;
use threads::shared;
my %pages;
my @threads;
share %pages;
foreach my $site (qw( http://mysite.com/page http://myothersite.com/page )) {
my $thread = threads->new('fetch', $site);
push @threads, $thread;
}
$_->join for @threads;
for (scalar keys %pages) {
printf "%d %s fetched\n", $_, $_ == 1 ? 'page' : 'pages';
}
sub fetch {
my ($url) = @_;
my $page = `curl -s $url -m 2`;
lock %pages;
$pages{$url} = $page;
}
Upvotes: 3