Perl fast matrix multiply

I have implemented the following statistical computation in perl http://en.wikipedia.org/wiki/Fisher_information.
The results are correct. I know this because I have 100's of test cases that match input and output. The problem is that I need to compute this many times every single time I run the script. The average number of calls to this function is around 530. I used Devel::NYTProf to find out this out as well as where the slow parts are. I have optimized the algorithm to only traverse the top half of the matrix and reflect it onto the bottom as they are the same. I'm not a perl expert, but I need to know if there is anything I can try to speed up the perl. This script is distributed to clients so compiling a C file is not an option. Is there another perl library I can try? This needs to be sub second in speed if possible.

More information is $MatrixRef is a matrix of floating point numbers that is $rows by $variables. Here is the NYTProf dump for the function.

#-----------------------------------------------
#
#-----------------------------------------------
sub ComputeXpX
# spent 4.27s within ComputeXpX which was called 526 times, avg 8.13ms/call:
# 526 times (4.27s+0s) by ComputeEfficiency at line 7121, avg 8.13ms/call
{
526 0s                  my ($MatrixRef, $rows, $variables) = @_;

526 0s                  my $r = 0;
526 0s                  my $c = 0;
526 0s                  my $k = 0;
526 0s                  my $sum = 0;
526 0s                  my @xpx = ();

526 11.0ms              for ($r = 0; $r < $variables; $r++)
                        {
14202   19.0ms                my @temp = (0) x $variables;
14202   6.01ms                push(@xpx, \@temp);
526 0s                  }
526 7.01ms              for ($r = 0; $r < $variables; $r++)
                        {
14202   144ms                   for ($c = $r; $c < $variables; $c++)
                                {
198828  43.0ms                                  $sum = 0;
                                        #for ($k = 0; $k < $rows; $k++)
198828  101ms                           foreach my $RowRef (@{$MatrixRef})
                                        {
                                                #$sum += $MatrixRef->[$k]->[$r]*$MatrixRef->[$k]->[$c];
6362496 3.77s                                   $sum += $RowRef->[$r]*$RowRef->[$c];
                                        }

198828  80.1ms                                  $xpx[$r]->[$c] = $sum;
                                                #reflect on other side of matrix
198828  82.1ms                                  $xpx[$c]->[$r] = $sum if ($r != $c);
14202   1.00ms                          }
526 2.00ms                  }

526 2.00ms                  return \@xpx;
}

Upvotes: 0

Answers (2)

amon

Reputation: 57640

There really isn't much you can do here, without rewriting parts in C, or moving to a better framework for mathematic operations than bare-bone Perl (→ PDL!).

Some minor optimization ideas:

You initialize @xpx with arrayrefs containing zeros. This is unneccessary, as you assign a value to every position either way. If you want to pre-allocate array space, assign to the $#array value:
```
my @array;
$#array = 100; # preallocate space for 101 scalars
```
This isn't generally useful, but you can benchmark with and without.
Iterate over ranges; don't use C-style for loops:
```
for my $c ($r .. $variables - 1) { ... }
```
Perl scalars aren't very fast for math operations, so offloading the range iteration to lower levels will gain a speedup.
Experiment with changing the order of the loops, and toy around with caching a level of array accesses. Keeping $my $xpx_r = $xpx[$r] around in a scalar will reduce the number of array accesses. If your input is large enough, this translates into a speed gain. Note that this only works when the cached value is a reference.

Remember that perl does very few “big” optimizations, and that the opcode tree produced by compilation closely resembles your source code.

Edit: On threading

Perl threads are heavyweight beasts that literally clone the current interpreter. It is very much like forking.

Sharing data structures across thread boundaries is possible (use threads::shared; my $variable :shared = "foo") but there are various pitfalls. It is cleaner to pass data around in a Thread::Queue.

Splitting the calculation of one product over multiple threads could end up with your threads doing more communication than calculation. You could benchmark a solution that divides responsibility for certain rows between the threads. But I think recombining the solutions efficiently would be difficult here.

More likely to be useful is to have a bunch of worker threads running from the beginning. All threads listen to a queue which contains a pair of a matrix and a return queue. The worker would then dequeue a problem, and send back the solution. Multiple calculations could be run in parallel, but a single matrix multiplication will be slower. Your other code would have to be refactored significantly to take advantage of the parallelism.

Untested code:

use strict; use warnings; use threads; use Thread::Queue;

# spawn worker threads:
my $problem_queue = Thread::Queue->new;
my @threads = map threads->new(\&worker, $problem_queue), 1..3; # make 3 workers

# automatically close threads when program exits
END {
  $problem_queue->enqueue((undef) x @threads);
  $_->join for @threads;
}

# This is the wrapper around the threading,
# and can be called exactly as ComputeXpX
sub async_XpX {
  my $return_queue = Thread::Queue->new();
  $problem_queue->enqueue([$return_queue, @_]);
  return sub { $return_queue->dequeue };
}

# The main loop of worker threads
sub worker {
  my ($queue) = @_;
  while(defined(my $problem = $queue->dequeue)) {
     my ($return, @args) = @$problem;
     $return->enqueue(ComputeXpX(@args));
  }
}

sub ComputeXpX { ... } # as before

The async_XpX returns a coderef that will eventually collect the result of the computation. This allows us to carry on with other stuff until we need the result.

# start two calculations
my $future1 = async_XpX(...);
my $future2 = async_XpX(...);

...; # do something else

# collect the results
my $result1 = $future1->();
my $result2 = $future2->();

I benchmarked the bare-bones threading code without doing actual calculations, and the communication is about as expensive as the calculations. I.e. with a bit of luck, you may start to get a benefit on a machine with at least four processors/kernel threads.

A note on profiling threaded code: I know of no way to do that elegantly. Benchmarking threaded code, but profiling with single-threaded test cases may be preferable.

Upvotes: 2

KJ4TIP

Reputation: 166

Since each element of the result matrix can be calculated independently, it should be possible to calculate some/all of them in parallel. In other words, none of the instances of the innermost loop depend on the results of any other, so they could run simultaneously on their own threads.

Upvotes: 2

Perl fast matrix multiply

Answers (2)

Edit: On threading

Related Questions