forking in perl when a common variable must be written to by the child processes

Question

I am processing large text files and wish to benefit from parallel processing. I intend to break the file into as many sub-files as there are cores on the system, and to read each sub-file in a separate, forked process.

However, a problem with this approach is the requirement to write the input data to a common array. My rudimentary understanding of forks is that such an operation is not feasible and that the data must be written to separate arrays. Is this true?

If so, would the best approach be to simply wait for all child processes to finish and then concatenate the arrays into a single structure for subsequent processing? I have composed a minimal example to illustrate my situation:

#! /usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;

my $nlines = `wc -l $input`;
chomp($nlines);
my $ncores = `nproc`;
chomp($ncores);
system( "split -n $ncores $input out_" ); # files are named out_aa, out_ab, etc.
my $pm = new Parallel::ForkManager($ncores);

# declare and initialize
our @data;
for(my $i=0;$i<$nlines;$i++) {
  $data[$i]=0;
}

my $string = "string_literal"; # basis for filtering input data
my @files = ;
foreach my $file (@files) {
  my $pid = $pm->start and next;
  open(my $FH,"-|","grep -n $string $file");
  while(<$FH>) {
   chomp();
   my ($lineNumber, $datum) = split(/:/,$_);
   $data[$lineNumber]=$datum;
  }
  close($FH);
  $pm->finish;
}
$pm->wait_all_children;
## concatenate arrays here? ##

As one can see from this code snippet, I create an array ($data) with length set equal to the number of lines in the input text file and initialize each element of the array to zero. I then read the text file (which is filtered by grep -n) and place the datum from each line in the appropriately numbered element in the array (thus, I conclude with an array containing a certain value for all grep-filtered lines and zeroes for lines that fail to match.

Assuming I split the file into sub-files for parallel processing as outlined above, is the best approach to split the arrays into sub-arrays for ultimate concatenation after the wait_all_children condition is satisfied?

If so, I suppose the arrays should be lexically scoped to loop itself to prevent any issues associated with attempting to write to the same block of memory by the parallel processes. Also, if post-loop concatenation is recommended, how are the arrays from the forked processes referenced?

Each has the same name ($data) and will probably be local to the forked process, and thus it is unclear to me how they are to be addressed. Can the data instead be written to some global array that is accessible to all the child processes (this consideration prompted the use of our in the declaration of the array in the example above)?

Perhaps, this is an example of an XY problem, and an entirely different approach is recommended in this situation (e.g., threading)? Is the initial subdivision of a text file using the split command line utility recommended for delivering the data piecewise to parallel processes, or is another (e.g., Perl-intrinsic) approach for data parceling advisable? Thanks.

forking in perl when a common variable must be written to by the child processes

Answers (1)

Related Questions