Reputation: 99
I have a perl script I wish to parrallelise.
It is composed of a while
loop with over 11000 lines inside of another while
loop of 3400 lines, which makes it extremely slow.
open (FILE1, "File1.txt") or die "Can't open File1";
open (OUT, ">Outfile.txt");
while (<FILE1>)
{
my @data=split (/ /, $_);
my $RS=1;
open (FILE2, "File2.txt") or die "Can't open File2";
while (<FILE2>)
{
my @value=split (/ /, $_);
if ($data[$RS] == 1) {print OUT $value[1];$RS++;}
elsif ($data[$RS] == 2) {print OUT $value[2];$RS++;}
elsif ($data[$RS] == 0) {print OUT $value[3];$RS++;}
}
close FILE2;
}
I'm looking for a way to do the equivalent of qsub with every line of File1
so I can send 3440 jobs. Any suggestions? I'd like to stay with perl if possible. I tried to insert this code inside of a bash script, but I don't really understand how to insert a language inside another one.
My File1
contains a list of ID with information in column. Each column is then related to a single line in File2
. I'd like to be able to run the second loop for multiple ID simultaneously instead of one after another.
File1
ID RS_10 RS_15 RS_30
23 1 0 1
34 2 2 0
45 1 1 0
23 0 0 2
10 2 1 1
File2
RS_10 A B C
RS_15 D E F
RS_30 G H I
Upvotes: 2
Views: 269
Reputation: 57600
The first rule of optimization is not to do it too early (i.e. jumping to premature conclusions without profiling your code).
The second rule would probably refer to caching.
The File2
of yours isn't very large. I'd say we load it into memory. This has the following advantages:
About that first point: You split each line over three thousand times. Those cycles could have been better spent.
About that third point: you seem to do an index conversion:
1 → 1, 2 → 2, 0 → 3
Instead of testing for all values with an if/elsif-switch (linear complexity), we could use an array that does this translation (constant time lookups):
my @conversion = (3, 1, 2);
...;
print OUT $value[$conversion[$data[$RS++]]];
If this index conversion is constant, we could do it once and only once when parsing File2
. This would look like
use strict; use warnings;
use autodie; # automatic error handling
my @file2;
{
open my $file2, "<", "File2.txt";
while (<$file2>) {
my (undef, @vals) = split;
# do the reordering. This is equivalent to @vals = @vals[2, 0, 1];
unshift @vals, pop @vals;
push @file2, \@vals;
}
}
Now we can move on to iterating through File1
. Printing the corresponding entry from File2
now looks like
open my $file1, "<", "File1.txt";
<$file1>; # remove header
while (<$file1>) {
my ($id, @indices) = split;
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
# but I guess you'd want some separator in between
# If so, set the $, variable
}
This algorithm is still quadratic (the map
is just a for
-loop in disguise), but this should have a better constant factor. The output of above code given your example input is
23 A F G
34 B E I
45 A D I
23 C F H
10 B D G
(with $, = " "; $\ = "\n"
).
This last step (looping through File1) could be parallelized, but this is unlikely to help much: IO is slow, communication between threads is expensive (IPC even more so), and the output would be in random order. We could spawn a bunch of workers, and pass unparsed lines in a queue:
use threads; # should be 1st module to be loaded
use Thread::Queue;
use constant NUM_THREADS => 4; # number of cores
# parse the File2 data here
my $queue = Thread::Queue->new;
my @threads = map threads->new(\&worker), 1 .. NUM_THREADS;
# enqueue data
$queue->enqueue($_) while <$file1>;
# end the queue
$queue->enqueue((undef) x NUM_THREADS); # $queue->end in never versions
# wait for threads to complete
$_->join for @threads;
sub worker {
while(defined(my $_ = $queue->dequeue)) {
my ($id, @indices) = split;
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
}
}
Note that this copies the @file2
into all threads. Fun fact: for the example data, this threaded solution takes roughly 4× as long. This is mostly the overhead of thread creation, so this will be less of an issue for your data.
In any case, profile your code to see where you can optimize most effectively. I recommend the excellent Devel::NYTProf
. E.g. for my non-threaded test run with this very limited data, the overhead implied by autodie
and friends used more time than doing the actual processing. For you, the most expensive line would probably be
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
but there isn't much we can do here inside Perl.
Upvotes: 5