Reputation: 1949
I'm wondering whether it would be a good idea for use Parallel::ForkManager
(or other parallelization tool) to process some files that I have. Basically, I am processing a very large file, and outputting its contents into multiple files. This usually takes some ~3 hours in a 64-core server.
What I am wondering is how does the implementation of this Module gather the data. For instance, if I do
use Parallel::ForkManager;
# Max 30 processes
my $pm = new Parallel::ForkManager(64);
open my $in,"<","D:\myfile.txt";
my @data=<$in>;
close $in;
#gathers unique dataheaders
my @uniqueheaders;
foreach my $line (@data){
my @split=split "\t",$line;
push @uniqueheaders,$split[0] unless (grep{$_=~/$split[0]} @uniqueheaders);
}
foreach my $head (@uniqueheaders) {
$pm->start and next; # do the fork
(my @matches) = grep{$_=~/^$head\t/} @data; #finds all matches in @data started by $head
if($#matches>1){ #prints out if matches are found
open my $out,">",'D:\directory\'."$head".'data';
print $out @matches;
close $out;
}
else{ print "Problem in $head!\n";}
$pm->finish; # do the exit in the child process
}
$pm->wait_all_children;
Now, my questions are:
$head
be allocated to one core at a time or would I have to watch for something else that I'm unaware?@gatherstuff
before the last foreach
loop, where instead of print
ing out, it would push @gatherstuff,@matches;
. Is this as simple as I'm making it? Upvotes: 2
Views: 1148
Reputation: 1869
Before you try making the code run in parallel, try and see if you can optimise your code to run efficiently in serial. If the benefit of this optimisation is not enough, then you can try using Parallel::ForkManager
. Some of the issues with your code are:
@data
array would take up time. If the amount of RAM is less, you'll have a lot of swapping to disk which is a lot more time consuming.grep
is used instead of a hash for 'contains' checks: grep
ping over such a large number of records multiple times is incredibly slow and not scalable at all. As of now, the process of extracting headers is of the order O(n^2)
, where n
is the number of records in the input file. If you use a hash, the order will be O(n)
which is much more manageable. A similar argument applies to the way you're extracting the matching records.This is the way I would solve it, without making the code run in parallel. You may need to increase the number of open file descriptors allowed using the ulimit -n
command.
use strict;
use warnings;
my ($input_file, $output_dir) = (@ARGV);
die "Syntax: $0 <input_file> <output_dir>"
unless $input_file and $output_dir;
open my $in, '<', $input_file
or die "Could not open input file $input_file: $!";
# map of ID (aka header) -> file handle
my %idfh;
while (my $line = <$in>) {
# extract the ID
$line =~ /^(.+?)\t/;
my $id = $1;
# get the open file handle
my $fh = $idfh{$id};
unless ($fh) {
# if there was no file handle for this ID, open a new one
open $fh, '>', "$output_dir/${id}data"
or die "Could not open file for ID $id: $!";
$idfh{$id} = $fh;
}
# print the record to the correct file handle
print $fh $line;
}
# perl automatically closes all file handles
This is pretty simple:
Upvotes: 2
Reputation: 118118
Using Parallel::ForkManager with a single input file may end up making sense only if you preprocess the file to determine ranges to allocate to each worker. And, that only makes sense if you are going to repeat the work multiple times with the same input.
Even if you might gain something from using Parallel::ForkManager
, having 30 processes trying to do IO is not going to get you anything. The most I would recommend is twice the number of cores if the system is otherwise not doing anything else, assuming you have a lot of memory.
The operating system's caching may result in different processes actually reading the file from memory after the initial warm up, and lead to gains from having multiple processes do the processing.
The writes are much less likely to benefit from having multiple processes for many reasons. The processes will be reading from all over the memory space, processes will have to wait for buffers to be flushed to disk etc etc. In this case, the IO bottleneck will definitely be more prominent.
Upvotes: 3