What's the most efficient implementation of dense-matrix-vector multiplication?

Question

If M is a dense m x n matrix and v is an n-component vector, then the product u = Mv is an m-component vector given by u[i] = sum(M[i,j] * v[j], 1 <= j <= n). One simple implementation of this multiplication is

allocate m-component vector u of zeroes
for i = 1:m
  for j = 1:n
    u[i] += M[i,j] * v[j]
  end
end

which builds up the vector u one element at a time. Another implementation is to interchange the loops:

allocate m-component vector u of zeroes
for j = 1:n
  for i = 1:m
    u[i] += M[i,j] * v[j]
  end
end

where the entire vector is built up together.

Which of these implementations (if either) is typically used in languages like C and Fortran that are designed for efficient numerical computation? My guess is that languages like C that internally store matrices in row-major order use the former implementation, while languages like Fortran that use column-major order use the latter, so that the inner loop accesses consecutive memory sites for the matrix M. Is this correct?

The former implementation seems more efficient, because the memory location being written to only changes m times, while in the latter implementation the memory location being written to changes m*n times, even though only m unique locations are ever written to. (Of course, by the same logic, the latter implementation would be more efficient for row-vector-matrix multiplication, but this is much less common.) On the other hand, I believe that Fortran is typically faster at dense-matrix-vector multiplication than C, so perhaps I am either (a) guessing their implementations wrong, or (b) misjudging the relative efficiency of the two implementations.

user555045 · Accepted Answer

Probably using an established BLAS implementation is the most common. Apart from that, there are some issues with simple implementations that may be interesting to look at. For example, in C (or C++ for that matter), pointers aliasing often prevents a lot of optimization, and thus for example

void matvec(double *M, size_t n, size_t m, double *v, double * u)
{
    for (size_t i = 0; i < m; i++) {
      for (size_t j = 0; j < n; j++) {
        u[i] += M[i * n + j] * v[j];
      }
    }
}

Is turned into this by Clang 5 (inner loop excerpt)

.LBB0_4: # Parent Loop BB0_3 Depth=1
  vmovsd xmm1, qword ptr [rcx + 8*rax] # xmm1 = mem[0],zero
  vfmadd132sd xmm1, xmm0, qword ptr [r13 + 8*rax - 24]
  vmovsd qword ptr [r8 + 8*rbx], xmm1
  vmovsd xmm0, qword ptr [rcx + 8*rax + 8] # xmm0 = mem[0],zero
  vfmadd132sd xmm0, xmm1, qword ptr [r13 + 8*rax - 16]
  vmovsd qword ptr [r8 + 8*rbx], xmm0
  vmovsd xmm1, qword ptr [rcx + 8*rax + 16] # xmm1 = mem[0],zero
  vfmadd132sd xmm1, xmm0, qword ptr [r13 + 8*rax - 8]
  vmovsd qword ptr [r8 + 8*rbx], xmm1
  vmovsd xmm0, qword ptr [rcx + 8*rax + 24] # xmm0 = mem[0],zero
  vfmadd132sd xmm0, xmm1, qword ptr [r13 + 8*rax]
  vmovsd qword ptr [r8 + 8*rbx], xmm0
  add rax, 4
  cmp r11, rax
  jne .LBB0_4

That really hurts to look at, and it will hurt even more to execute. The compiler "had to" do this because u may alias with M and/or v, so stores into u are treated with great suspicion ("had to" is in quotes because the compiler could insert a test for aliasing and have a fast-path for the nice case). In Fortran, procedure arguments by default cannot alias, so this problem wouldn't have existed. This is a typical reason why code that is just randomly typed out without special tricks is faster in Fortran than in C - the rest of my answer won't be about that, I'm just going to make the C code a bit faster (in the end I get back to a column-major M). In C one way the aliasing problem can be fixed is with restrict, but about the only thing it has going for it is that it's not intrusive (using an explicit accumulator instead of summing into u[i] also does the trick, but without relying on a magic keyword)

void matvec(double *M, size_t n, size_t m, double *v, double * restrict u)
{
    for (size_t i = 0; i < m; i++) {
      for (size_t j = 0; j < n; j++) {
        u[i] += M[i * n + j] * v[j];
      }
    }
}

Now this happens:

.LBB0_8: # Parent Loop BB0_3 Depth=1
  vmovupd ymm5, ymmword ptr [rcx + 8*rbx]
  vmovupd ymm6, ymmword ptr [rcx + 8*rbx + 32]
  vmovupd ymm7, ymmword ptr [rcx + 8*rbx + 64]
  vmovupd ymm8, ymmword ptr [rcx + 8*rbx + 96]
  vfmadd132pd ymm5, ymm1, ymmword ptr [rax + 8*rbx - 224]
  vfmadd132pd ymm6, ymm2, ymmword ptr [rax + 8*rbx - 192]
  vfmadd132pd ymm7, ymm3, ymmword ptr [rax + 8*rbx - 160]
  vfmadd132pd ymm8, ymm4, ymmword ptr [rax + 8*rbx - 128]
  vmovupd ymm1, ymmword ptr [rcx + 8*rbx + 128]
  vmovupd ymm2, ymmword ptr [rcx + 8*rbx + 160]
  vmovupd ymm3, ymmword ptr [rcx + 8*rbx + 192]
  vmovupd ymm4, ymmword ptr [rcx + 8*rbx + 224]
  vfmadd132pd ymm1, ymm5, ymmword ptr [rax + 8*rbx - 96]
  vfmadd132pd ymm2, ymm6, ymmword ptr [rax + 8*rbx - 64]
  vfmadd132pd ymm3, ymm7, ymmword ptr [rax + 8*rbx - 32]
  vfmadd132pd ymm4, ymm8, ymmword ptr [rax + 8*rbx]
  add rbx, 32
  add rbp, 2
  jne .LBB0_8

It's not scalar any more, so that's good. But not ideal. While there are 8 FMAs here, they're arranged in four pairs of dependent FMAs. Taken across the whole loop, there are actually only 4 independent dependency chains of FMAs. FMA typically has a long latency and decent throughput though, for example on Skylake it has a latency of 4 and a throughput of 2/cycle, so 8 independent chains of FMAs are needed to utilize all of that compute throughput. Haswell is even worse, FMA had a latency of 5 and already a throughput of 2/cycle, so it needed 10 independent chains. An other problem is that it is hard to actually feed all of those FMAs, the structure above does not even really try: it uses 2 loads per FMA, while loads actually have the same throughput as FMAs so their ratio should be 1:1.

Improving the load:FMA ratio can be done by unrolling the outer loop, which lets us re-use the loads from v for several rows of M (this is not even a caching consideration, but it helps for that, too). Unrolling the outer loop also works towards the goal of having more independent chains of FMAs. Compilers typically don't like to unroll anything but the inner loop, so this takes some manual work. "Tail" iterations omitted (or: assume m is a multiple of 4).

void matvec(double *M, size_t n, size_t m, double *v, double * restrict u)
{
    size_t i;
    for (i = 0; i + 3 < m; i += 4) {
      for (size_t j = 0; j < n; j++) {
        size_t it = i;
        u[it] += M[it * n + j] * v[j];
        it++;
        u[it] += M[it * n + j] * v[j];
        it++;
        u[it] += M[it * n + j] * v[j];
        it++;
        u[it] += M[it * n + j] * v[j];
      }
    }
}

Unfortunately, Clang still decides to unroll the inner loop wrong, with "wrong" being that naive serial-unroll. There's not much point as long as there are still only 4 independent chains:

.LBB0_8: # Parent Loop BB0_3 Depth=1
  vmovupd ymm5, ymmword ptr [rcx + 8*rdx]
  vmovupd ymm6, ymmword ptr [rcx + 8*rdx + 32]
  vfmadd231pd ymm4, ymm5, ymmword ptr [r12 + 8*rdx - 32]
  vfmadd231pd ymm3, ymm5, ymmword ptr [r13 + 8*rdx - 32]
  vfmadd231pd ymm2, ymm5, ymmword ptr [rax + 8*rdx - 32]
  vfmadd231pd ymm1, ymm5, ymmword ptr [rbx + 8*rdx - 32]
  vfmadd231pd ymm4, ymm6, ymmword ptr [r12 + 8*rdx]
  vfmadd231pd ymm3, ymm6, ymmword ptr [r13 + 8*rdx]
  vfmadd231pd ymm2, ymm6, ymmword ptr [rax + 8*rdx]
  vfmadd231pd ymm1, ymm6, ymmword ptr [rbx + 8*rdx]
  add rdx, 8
  add rdi, 2
  jne .LBB0_8

This problem goes away if we stop being lazy and finally make some explicit accumulators:

void matvec(double *M, size_t n, size_t m, double *v, double *u)
{
    size_t i;
    for (i = 0; i + 3 < m; i += 4) {
      double t0 = 0, t1 = 0, t2 = 0, t3 = 0;
      for (size_t j = 0; j < n; j++) {
        size_t it = i;
        t0 += M[it * n + j] * v[j];
        it++;
        t1 += M[it * n + j] * v[j];
        it++;
        t2 += M[it * n + j] * v[j];
        it++;
        t3 += M[it * n + j] * v[j];
      }
      u[i] += t0;
      u[i + 1] += t1;
      u[i + 2] += t2;
      u[i + 3] += t3;
    }
}

Now Clang does this:

.LBB0_6: # Parent Loop BB0_3 Depth=1
  vmovupd ymm8, ymmword ptr [r10 - 32]
  vmovupd ymm9, ymmword ptr [r10]
  vfmadd231pd ymm6, ymm8, ymmword ptr [rdi]
  vfmadd231pd ymm7, ymm9, ymmword ptr [rdi + 32]
  lea rax, [rdi + r14]
  vfmadd231pd ymm4, ymm8, ymmword ptr [rdi + 8*rsi]
  vfmadd231pd ymm5, ymm9, ymmword ptr [rdi + 8*rsi + 32]
  vfmadd231pd ymm1, ymm8, ymmword ptr [rax + 8*rsi]
  vfmadd231pd ymm3, ymm9, ymmword ptr [rax + 8*rsi + 32]
  lea rax, [rax + r14]
  vfmadd231pd ymm0, ymm8, ymmword ptr [rax + 8*rsi]
  vfmadd231pd ymm2, ymm9, ymmword ptr [rax + 8*rsi + 32]
  add rdi, 64
  add r10, 64
  add rbp, -8
  jne .LBB0_6

Which is decent. The load:FMA ratio is 10:8 and there are too few accumulators for Haswell, so some improvement is still possible. Some other interesting unrolling combinations are (outer x inner) 4x3 (12 accumulators, 3 temporaries, 5/4 load:FMA), 5x2 (10, 2, 6/5), 7x2 (14, 2, 8/7), 15x1 (15, 1, 16/15). That makes it look as through unrolling the outer loop is better, but having too many different streams (even if not "streaming" in the sense of "streaming loads") is bad for automatic prefetching, and when actually streaming it may be bad to exceed the number of fill buffers (actual details are scarce). Manual prefetching is also an option. Getting to an actually good MVM procedure would take a lot more work, trying out a lot of these things.

Saving the stores into u for outside the inner loop meant that restrict was no longer necessary. Most impressively, I think, is that no SIMD intrinsics were needed to get this far - Clang is pretty good with that, if there is no scary potential aliasing. GCC and ICC do try, but don't unroll enough, yet more manual unrolling would probably do the trick..

Loop tiling is also an option, but this is MVM. Tiling is extremely necessary for MMM, but MMM has an almost unlimited amount of data-reuse, which MVM does not have. Only the vector is reused, the matrix is just streamed through once. Likely memory throughput to stream a huge matrix will be a bigger problem than the vector not fitting in cache.

With a column-major M, it's different, with no significant loop-carried dependency. There is a dependency through memory but it has a lot of time. The load:FMA ratio still has to be reduced though, so it still takes some unrolling of the outer loop, but overall it seems easier to deal with. It can be rearranged to use mostly additions, but FMA has a high throughput anyway (on HSW, higher than addition!). It does not need the horizontal sums, which were annoying but they happened outside the inner loop anyway. In return there are stores in the inner loop. Without trying it, I don't expect a large inherent difference between those approaches, it seems like both ways should be tunable to between 80 and 90 percent of the compute throughput (for cacheable sizes). The "annoying extra load" inherently prevents getting arbitrarily close to 100% either way.

What's the most efficient implementation of dense-matrix-vector multiplication?

Answers (1)

Related Questions

What&#39;s the most efficient implementation of dense-matrix-vector multiplication?

Answers (1)

Related Questions

What's the most efficient implementation of dense-matrix-vector multiplication?