krish
krish

Reputation: 33

how to optimize matrix multiplication in cpp in terms of time complexity?

given any 2 matrics a and b (which don't have special properties) do we have a better way of computing the multiplication than this:?

for(i=0; i<r1; ++i)
for(j=0; j<c2; ++j)
for(k=0; k<c1; ++k)
{
    mult[i][j]+=a[i][k]*b[k][j];
}

Upvotes: 3

Views: 1990

Answers (5)

Dov
Dov

Reputation: 8542

This is a good question that deserves a more complete answer than "use a library".

Of course, if you want to do a good job, you probably should not try to write it yourself. But if this question is about learning how to do matrix multiplication faster, here is a complete answer.

  1. As a practical matter, the code you show writes to memory too much. If the inner loop adds the dot product in a scalar variable, then only write at the end, the code will be faster. Most compilers are not smart enough to understand this.

    double dot = 0; for(k=0; k

This also improves multi-core performance, since if you use multiple cores they have to share memory bandwidth. If you are using an array of rows, switch your representation to a single block of memory.

  1. As mentioned by someone above, you can do a transpose so the matrix traversals are both in sequential order. Memory is designed to efficiently read in sequentially, but your b[k][j] is jumping around, so this is about 3x faster typically as the size gets big (on the order of 1000x1000, the cost of the initial transpose is negligable).

  2. When the matrix gets large enough, Strassen and Coppersmith-Winograd are faster ways of multiplying that fundamentally change the rules, but they do so by cleverly rearranging terms to achieve the same theoretical result with a lower complexity bound. In practice, they change the answer because roundoff error is different and for large matrices, the answers produced by these algorithms is likely to be far worse than the brute force multiplication.

  3. If you have a truly parallel computer, you can copy the matrix to multiple CPUs and have them work in parallel on the answer.

  4. You can put the code onto your video card and use the far more parallel CPUs there which have far more memory bandwidth. That's probably the most effective way to get a real speedup on your computer (assuming you have a graphics card). See CUDA or Vulkan.

The fundamental problem is that multiple cores don't help much for matrix multiplication because you are limited by memory bandwidth. That's why doing it on a video card is so good, because bandwidth there is far higher.

Upvotes: 1

George Kourtis
George Kourtis

Reputation: 2592

You could use multiple threads by dividing the multiplication to them. So divide the lines/columns of the first dimension of the first matrix or the last dimension of the last into a number of tasks equal to the cores you have in your processor. If these aren't evenly divisible, some cores will have to do an extra cycle. But any way, the idea is give the multiplication to more cores and divide e.g. the first matrix in 4 parts ( I have 4 cores), do the multiplication with 4 tasks, and reassemble (that isn't necessary as the cores may work on the same data).

Upvotes: 0

PepeHands
PepeHands

Reputation: 1406

If you are curious if they exist in theory, then yes. For example, Strassen algorithm (see https://en.wikipedia.org/wiki/Strassen_algorithm). And it's not even the fastest we know. As far as I'm concerned the best for now is Coppersmith–Winograd algorithm (see https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm) and it is something like O(n^{2.37}) (Strassen's time complexity is something like O(n^{2.8}).

But in practice they are much harder to implement than the one you wrote and also they have pretty large time constant hidden under O() so O(n^3) algorithm you wrote is even better on low values of n and much easier to implement.

Also there is a Strassen's hypothesis which claims that for every eps > 0 there is an algorithm which multiplies two matrixes with time complexity O(n^{2 + eps}). But as you might have noticed it is just an hypothesis for now.

Upvotes: 7

Chiel
Chiel

Reputation: 6194

These are the problems that many bright souls in this world have solved before you. Do not torture yourself and use BLAS ?GEMM.

http://www.netlib.org/blas/#_level_3

Upvotes: 2

Victor
Victor

Reputation: 428

As a very easy solution you can transpose the second matrix before multiplication, so your code will get much less processor cache misses. The complexity will be the same but it may improve a time constant a bit.

Upvotes: 3

Related Questions