Rajeev
Rajeev

Reputation: 1371

How to fork limited process in background from sub processing 1000s of files

I simply want to open a compressed/uncompressed file in the background and produce a new file based on the processing done on the compressed file.

I could do it with Parallel::ForkManager, but I believe that is not available.

I found this, but am not sure how to use it:

sub backgroundProcess {
    my $file = shift;
    my $pid  = fork;
    return if $pid;    # in the parent process
    &process_file($file);
    exit;              # end child process
}

sub process_file {
    my $file    = shift;
    my $outFile = $file . ".out";
    # ...here...
    open( readHandle,  "<", $file )    or die print "failed $!";
    open( writeHandle, ">", $outFile ) or die "failed write $!";
    # some processing here.....
    # and then closing handles...
}

The loop:

foreach my $file (@filesToProcess) {
    &backgroundProcess($file);
}

My questions:

  1. does the child process created in backgroundProcess run even after the return occurs (in the line return if $pid?
  2. in process_file, how do I make sure a unique file handle is open for each file, or will "fork" take care of it?
  3. in the loop (going through @filesToProcess), I want to run only a certain number of processes at a time, so how do I check if number of background process is equal to $LIMIT, and then open a new one as an old one finishes?

Upvotes: 0

Views: 276

Answers (2)

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118118

If I understand the title of your question, you are looking for Parallel::ForkManager.

I do not understand why Parallel::ForkManager is not available. It is a pure Perl module.

use Parallel::ForkManager;

my $pm = Parallel::ForkManager->new($MAX_PROCESSES);

for my $file (@filesToProcess) {
  # Forks and returns the pid for the child:
  my $pid = $pm->start and next;

  ... do some work with $data in the child process ...

  $pm->finish; # Terminates the child process
}

You can just copy the module's .pm file in a place you can find. For example:

/some/custom/path/myscript
/some/custom/path/inc/Parallel/Forkmanager.pm

Then, in myscript:

use FindBin qw( $RealBin );
use lib "$RealBin/inc";
use Parallel::ForkManager;

And, of course, if, for some unfathomable reason you can't do that, you can always fatpack your script.

Upvotes: 3

Re Q1: Yes. Only the parent process will execute the return, as $pid will be zero in the child process.

Re Q2: Not sure if I'm understanding your question correctly. open() will be executed in the child process, so file handles will be local to the child process.

Re Q3: You'll have to keep track manually. Once the limit has been reached, call wait() to wait for one child to exit before starting a new child process. See http://perldoc.perl.org/functions/wait.html

Upvotes: 0

Related Questions