Sos
Sos

Reputation: 1949

Using Parallel::ForkManager to process file

I'm wondering whether it would be a good idea for use Parallel::ForkManager (or other parallelization tool) to process some files that I have. Basically, I am processing a very large file, and outputting its contents into multiple files. This usually takes some ~3 hours in a 64-core server.

What I am wondering is how does the implementation of this Module gather the data. For instance, if I do

use Parallel::ForkManager;
# Max 30 processes
my $pm = new Parallel::ForkManager(64);

open my $in,"<","D:\myfile.txt";
my @data=<$in>;
close $in;

#gathers unique dataheaders
my @uniqueheaders;
foreach my $line (@data){
  my @split=split "\t",$line;
  push @uniqueheaders,$split[0] unless (grep{$_=~/$split[0]} @uniqueheaders);
}

foreach my $head (@uniqueheaders) {
   $pm->start and next; # do the fork

   (my @matches) = grep{$_=~/^$head\t/} @data; #finds all matches in @data started by $head
   if($#matches>1){ #prints out if matches are found
      open my $out,">",'D:\directory\'."$head".'data';
      print $out @matches;
      close $out;
   }
   else{ print "Problem in $head!\n";}

   $pm->finish; # do the exit in the child process
}
$pm->wait_all_children;

Now, my questions are:

  1. Do you see any problem in making the script like this? Would each $head be allocated to one core at a time or would I have to watch for something else that I'm unaware?
  2. What if I wanted to process the whole data and output it once? For instance, create an array @gatherstuff before the last foreach loop, where instead of printing out, it would push @gatherstuff,@matches;. Is this as simple as I'm making it?

Upvotes: 2

Views: 1148

Answers (2)

svsd
svsd

Reputation: 1869

Before you try making the code run in parallel, try and see if you can optimise your code to run efficiently in serial. If the benefit of this optimisation is not enough, then you can try using Parallel::ForkManager. Some of the issues with your code are:

  1. The whole file is read into memory: Reading such a large number of lines at once will increase the memory usage of your program very much, but can also increase the amount of time if takes to execute. Memory may not be a concern, but the repeated reallocations of the @data array would take up time. If the amount of RAM is less, you'll have a lot of swapping to disk which is a lot more time consuming.
  2. grep is used instead of a hash for 'contains' checks: grepping over such a large number of records multiple times is incredibly slow and not scalable at all. As of now, the process of extracting headers is of the order O(n^2), where n is the number of records in the input file. If you use a hash, the order will be O(n) which is much more manageable. A similar argument applies to the way you're extracting the matching records.
  3. The 'headers' are extracted at the start: This may be necessary in your current approach to running the code in parallel, but you can try an avoid this since it iterates over all the records.

This is the way I would solve it, without making the code run in parallel. You may need to increase the number of open file descriptors allowed using the ulimit -n command.

use strict;
use warnings;

my ($input_file, $output_dir) = (@ARGV);

die "Syntax: $0 <input_file> <output_dir>"
    unless $input_file and $output_dir;

open my $in, '<', $input_file
    or die "Could not open input file $input_file: $!";

# map of ID (aka header) -> file handle
my %idfh;

while (my $line = <$in>) {
    # extract the ID
    $line =~ /^(.+?)\t/;

    my $id = $1;
    # get the open file handle
    my $fh = $idfh{$id};

    unless ($fh) {
        # if there was no file handle for this ID, open a new one
        open $fh, '>', "$output_dir/${id}data"
            or die "Could not open file for ID $id: $!";

        $idfh{$id} = $fh;
    }

    # print the record to the correct file handle
    print $fh $line;
}

# perl automatically closes all file handles

This is pretty simple:

  1. Iterate over each line of the file. For each iteration, do the following:
  2. Extract the ID.
  3. If we have not seen the ID before, open the file corresponding to the ID for writing. Otherwise, go to step 4.
  4. Store the file handle in a map with the ID as the key.
  5. If the ID was seen earlier, get the file handle from the hash.
  6. Write the record through the file handle.

Upvotes: 2

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118118

Using Parallel::ForkManager with a single input file may end up making sense only if you preprocess the file to determine ranges to allocate to each worker. And, that only makes sense if you are going to repeat the work multiple times with the same input.

Even if you might gain something from using Parallel::ForkManager, having 30 processes trying to do IO is not going to get you anything. The most I would recommend is twice the number of cores if the system is otherwise not doing anything else, assuming you have a lot of memory.

The operating system's caching may result in different processes actually reading the file from memory after the initial warm up, and lead to gains from having multiple processes do the processing.

The writes are much less likely to benefit from having multiple processes for many reasons. The processes will be reading from all over the memory space, processes will have to wait for buffers to be flushed to disk etc etc. In this case, the IO bottleneck will definitely be more prominent.

Upvotes: 3

Related Questions