Using Parallel::ForkManager to process file

Question

I'm wondering whether it would be a good idea for use Parallel::ForkManager (or other parallelization tool) to process some files that I have. Basically, I am processing a very large file, and outputting its contents into multiple files. This usually takes some ~3 hours in a 64-core server.

What I am wondering is how does the implementation of this Module gather the data. For instance, if I do

use Parallel::ForkManager;
# Max 30 processes
my $pm = new Parallel::ForkManager(64);

open my $in,"<","D:\myfile.txt";
my @data=<$in>;
close $in;

#gathers unique dataheaders
my @uniqueheaders;
foreach my $line (@data){
  my @split=split "	",$line;
  push @uniqueheaders,$split[0] unless (grep{$_=~/$split[0]} @uniqueheaders);
}

foreach my $head (@uniqueheaders) {
   $pm->start and next; # do the fork

   (my @matches) = grep{$_=~/^$head	/} @data; #finds all matches in @data started by $head
   if($#matches>1){ #prints out if matches are found
      open my $out,">",'D:\directory\'."$head".'data';
      print $out @matches;
      close $out;
   }
   else{ print "Problem in $head!
";}

   $pm->finish; # do the exit in the child process
}
$pm->wait_all_children;

Now, my questions are:

Do you see any problem in making the script like this? Would each $head be allocated to one core at a time or would I have to watch for something else that I'm unaware?
What if I wanted to process the whole data and output it once? For instance, create an array @gatherstuff before the last foreach loop, where instead of printing out, it would push @gatherstuff,@matches;. Is this as simple as I'm making it?

svsd · Accepted Answer

Before you try making the code run in parallel, try and see if you can optimise your code to run efficiently in serial. If the benefit of this optimisation is not enough, then you can try using Parallel::ForkManager. Some of the issues with your code are:

The whole file is read into memory: Reading such a large number of lines at once will increase the memory usage of your program very much, but can also increase the amount of time if takes to execute. Memory may not be a concern, but the repeated reallocations of the @data array would take up time. If the amount of RAM is less, you'll have a lot of swapping to disk which is a lot more time consuming.
grep is used instead of a hash for 'contains' checks: grepping over such a large number of records multiple times is incredibly slow and not scalable at all. As of now, the process of extracting headers is of the order O(n^2), where n is the number of records in the input file. If you use a hash, the order will be O(n) which is much more manageable. A similar argument applies to the way you're extracting the matching records.
The 'headers' are extracted at the start: This may be necessary in your current approach to running the code in parallel, but you can try an avoid this since it iterates over all the records.

This is the way I would solve it, without making the code run in parallel. You may need to increase the number of open file descriptors allowed using the ulimit -n command.

use strict;
use warnings;

my ($input_file, $output_dir) = (@ARGV);

die "Syntax: $0  "
    unless $input_file and $output_dir;

open my $in, '<', $input_file
    or die "Could not open input file $input_file: $!";

# map of ID (aka header) -> file handle
my %idfh;

while (my $line = <$in>) {
    # extract the ID
    $line =~ /^(.+?)	/;

    my $id = $1;
    # get the open file handle
    my $fh = $idfh{$id};

    unless ($fh) {
        # if there was no file handle for this ID, open a new one
        open $fh, '>', "$output_dir/${id}data"
            or die "Could not open file for ID $id: $!";

        $idfh{$id} = $fh;
    }

    # print the record to the correct file handle
    print $fh $line;
}

# perl automatically closes all file handles

This is pretty simple:

Iterate over each line of the file. For each iteration, do the following:
Extract the ID.
If we have not seen the ID before, open the file corresponding to the ID for writing. Otherwise, go to step 4.
Store the file handle in a map with the ID as the key.
If the ID was seen earlier, get the file handle from the hash.
Write the record through the file handle.

Using Parallel::ForkManager to process file

Answers (2)

Related Questions