Perl fork in recursive subroutine

Question

I'm trying to write a recursive script that parses a large directory using forks for better performance. For a simple example, let's say I want to do DFS that runs no more than 10 concurrent forks, something like

#!/usr/bin/perl
use warnings;
use strict;
use Parallel::ForkManager;

my $pm = Parallel::ForkManager->new(10);
&proc_dir("/some/large/directory");
$pm->wait_all_children;

sub proc_dir {
  my $path = shift;
  my(@child_dirlist);

  if(opendir(DIR, $path)) {
    my @d = grep { -d "$path/$_" } readdir(DIR);
    @child_dirlist =  map { "$path/$_" } @d;
    closedir(DIR);
  }

  foreach my $d (@child_dirlist) {
    my $pid = $pm->start and next; #This will fail within child processes
    &proc_dir($d);
    $pm->finish;
  }
}

But Parallel::ForkManager requires that if you want to fork off more processes from a child process you initialize another ForkManager, which kindof defeats the purpose of using one to begin with in this case. I've tried a few other modules/ways of doing this, but haven't had any success both limiting the number of forks to some #/threshold and getting them to work recursivly. Wondering if anyone has managed to solve similar issues or knows a simple workaround.

Edit: Please assume I've tested this enough that CPU and I/O load isn't a concern for some small number of forks.

BergBrains · Accepted Answer

I don't think you're going to realize the benefits you're looking for with this approach. You will simply fork off an unlimited number of processes until you overwhelm your machine. Your goal isn't to use a single instance of PFM, it's to expedite the processing.

To that end, I recommend that you look at the File::Find module. It's a Perl implementation that is related to find, and is most likely what you're looking for.

If I understand your sample code, you're simply looking for directories, so running find2perl will generate a wrapper script for File::Find:

find2perl /usr/share/emacs/ -type d

Will create the following script (pared down a bit):

#!/usr/bin/perl

use File::Find ();

# for the convenience of &wanted calls, including -eval statements:
use vars qw/*name *dir *prune/;
*name   = *File::Find::name;
*dir    = *File::Find::dir;
*prune  = *File::Find::prune;

# Traverse desired filesystems
File::Find::find({wanted => \&wanted}, '/usr/share/emacs/');
exit;

sub wanted {
    my ($dev,$ino,$mode,$nlink,$uid,$gid);

    (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) &&
    -d _
    && print("$name
");
}

In your wanted() subroutine, you can do whatever you want. It's shown to be faster than find, and certainly so if you apply logic to each file and do so in your script instead of doing it in subprocesses.

If you want to break the processing out across multiple processes, I recommend iterating over the top-level subdirectories using Parallel::ForkManager.

By the way, I wouldn't worry about instantiating multiple PFM objects. That's the least of your worries. Recursively implementing 10 subprocesses per subprocess is much riskier.

One more thing: if you still want to pursue the recursive PFM approach, you might try running both implementations against one another in a Benchmark harness.

Perl fork in recursive subroutine

Answers (1)

Related Questions