Parallelization of perl script

Question

I have a perl script I wish to parrallelise.

It is composed of a while loop with over 11000 lines inside of another while loop of 3400 lines, which makes it extremely slow.

open (FILE1, "File1.txt") or die "Can't open File1";
open (OUT, ">Outfile.txt");

while ()
{ 
 my @data=split (/ /, $_);
 my $RS=1;
 open (FILE2, "File2.txt") or die "Can't open File2";
 while ()
 { 
  my @value=split (/ /, $_);
  if ($data[$RS] == 1) {print OUT $value[1];$RS++;}
  elsif ($data[$RS] == 2) {print OUT $value[2];$RS++;}
  elsif ($data[$RS] == 0) {print OUT $value[3];$RS++;}
 }
 close FILE2;

}

I'm looking for a way to do the equivalent of qsub with every line of File1 so I can send 3440 jobs. Any suggestions? I'd like to stay with perl if possible. I tried to insert this code inside of a bash script, but I don't really understand how to insert a language inside another one.

My File1 contains a list of ID with information in column. Each column is then related to a single line in File2. I'd like to be able to run the second loop for multiple ID simultaneously instead of one after another.

File1
ID     RS_10    RS_15    RS_30
23     1        0        1
34     2        2        0
45     1        1        0
23     0        0        2
10     2        1        1


File2
RS_10 A B C
RS_15 D E F
RS_30 G H I

amon · Accepted Answer

The first rule of optimization is not to do it too early (i.e. jumping to premature conclusions without profiling your code).

The second rule would probably refer to caching.

The File2 of yours isn't very large. I'd say we load it into memory. This has the following advantages:

We do our parsing once and only once.
The file isn't obscenly large, so space isn't much of an issue.
We can create a data structure that makes lookups very simple.

About that first point: You split each line over three thousand times. Those cycles could have been better spent.

About that third point: you seem to do an index conversion:

1 → 1, 2 → 2, 0 → 3

Instead of testing for all values with an if/elsif-switch (linear complexity), we could use an array that does this translation (constant time lookups):

my @conversion = (3, 1, 2);
...;
print OUT $value[$conversion[$data[$RS++]]];

If this index conversion is constant, we could do it once and only once when parsing File2. This would look like

use strict; use warnings;
use autodie; # automatic error handling

my @file2;
{
  open my $file2, "<", "File2.txt";
  while (<$file2>) {
    my (undef, @vals) = split;

    # do the reordering. This is equivalent to @vals = @vals[2, 0, 1];
    unshift @vals, pop @vals;

    push @file2, \@vals;
  }
}

Now we can move on to iterating through File1. Printing the corresponding entry from File2 now looks like

open my $file1, "<", "File1.txt";
<$file1>; # remove header
while (<$file1>) {
  my ($id, @indices) = split;
  print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
  # but I guess you'd want some separator in between
  # If so, set the $, variable
}

This algorithm is still quadratic (the map is just a for-loop in disguise), but this should have a better constant factor. The output of above code given your example input is

23 A F G
34 B E I
45 A D I
23 C F H
10 B D G

(with $, = " "; $\ = " ").

Where to go from here

This last step (looping through File1) could be parallelized, but this is unlikely to help much: IO is slow, communication between threads is expensive (IPC even more so), and the output would be in random order. We could spawn a bunch of workers, and pass unparsed lines in a queue:

use threads; # should be 1st module to be loaded
use Thread::Queue;
use constant NUM_THREADS => 4; # number of cores

# parse the File2 data here

my $queue = Thread::Queue->new;
my @threads = map threads->new(\&worker), 1 .. NUM_THREADS;

# enqueue data
$queue->enqueue($_) while <$file1>;
# end the queue
$queue->enqueue((undef) x NUM_THREADS); # $queue->end in never versions

# wait for threads to complete
$_->join for @threads;

sub worker {
  while(defined(my $_ = $queue->dequeue)) {
    my ($id, @indices) = split;
    print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
  }
}

Note that this copies the @file2 into all threads. Fun fact: for the example data, this threaded solution takes roughly 4× as long. This is mostly the overhead of thread creation, so this will be less of an issue for your data.

In any case, profile your code to see where you can optimize most effectively. I recommend the excellent Devel::NYTProf. E.g. for my non-threaded test run with this very limited data, the overhead implied by autodie and friends used more time than doing the actual processing. For you, the most expensive line would probably be

  print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;

but there isn't much we can do here inside Perl.

Parallelization of perl script

Answers (1)

Where to go from here

Related Questions