mata safute
mata safute

Reputation: 1

perl, starting more processes at the same time

I am using Strawberry Perl on Windows XP to download multiple html pages, I want each in a variable.

Right now I am doing this, but as I see it, it gets one page at a time:

my $page = `curl -s http://mysite.com/page -m 2`;
my $page2 = `curl -s http://myothersite.com/page -m 2`;

I looked into Parallel::ForkManager, but couldnt get it to work. Also tried to use the windows command start before curl but that doesn't get the page.

Is there a more simple way to do this?

Upvotes: 0

Views: 259

Answers (1)

Borodin
Borodin

Reputation: 126762

The Parallel::ForkManager module should work for you, but because it uses fork instead of threads, the variables in the parent and each of the child processses is separate and they must communicate a different way.

This program uses the -o option of curl to save the pages in files. The file for, say, http://mysite.com/page is saved in file http\mysite.com\page and can be retrieved from there by the parent process.

use strict;
use warnings;

use Parallel::ForkManager;
use URI;
use File::Spec;
use File::Path 'make_path';

my $pm = Parallel::ForkManager->new(10);

foreach my $site (qw( http://mysite.com/page http://myothersite.com/page )) {
  my $pid = $pm->start;
  next if $pid;
  fetch($site);
  $pm->finish;
}

$pm->wait_all_children;

sub fetch {
  my ($url) = @_;

  my $uri = URI->new($url);
  my $filename = File::Spec->catfile($uri->scheme, $uri->host, $uri->path);
  my ($vol, $dir, $file) = File::Spec->splitpath($filename);

  make_path $dir;
  print `curl http://mysite.com/page -m 2 -o $filename`;
}

Update

Here is a version that uses threads with threads::shared to return each page into a hash shared between all the threads. The hash must be marked as shared, and locked before it is modified to prevent concurrent access.

use strict;
use warnings;

use threads;
use threads::shared;

my %pages;
my @threads;

share %pages;

foreach my $site (qw( http://mysite.com/page http://myothersite.com/page )) {
  my $thread = threads->new('fetch', $site);
  push @threads, $thread;
}

$_->join for @threads;

for (scalar keys %pages) {
  printf "%d %s fetched\n", $_, $_ == 1 ? 'page' : 'pages';
}

sub fetch {
  my ($url) = @_;
  my $page = `curl -s $url -m 2`;
  lock %pages;
  $pages{$url} = $page;
}

Upvotes: 3

Related Questions