Jiew Meng
Jiew Meng

Reputation: 88187

Matrix Multiplication optimization via matrix transpose

I am working on an assignment where I transpose a matrix to reduce cache misses for a matrix multiplication operation. From what I understand from a few classmates, I should get 8x improvement. However, I am only getting 2x ... what might I be doing wrong?

Full Source on GitHub

void transpose(int size, matrix m) {
    int i, j;
    for (i = 0; i < size; i++) 
        for (j = 0; j < size; j++) 
            std::swap(m.element[i][j], m.element[j][i]);
}

void mm(matrix a, matrix b, matrix result) {
    int i, j, k;
    int size = a.size;
    long long before, after;

    before = wall_clock_time();
    // Do the multiplication
    transpose(size, b); // transpose the matrix to reduce cache miss
    for (i = 0; i < size; i++)
        for (j = 0; j < size; j++) {
            int tmp = 0; // save memory writes
            for(k = 0; k < size; k++)
                tmp += a.element[i][k] * b.element[j][k];
            result.element[i][j] = tmp;
        }
    after = wall_clock_time();
    fprintf(stderr, "Matrix multiplication took %1.2f seconds\n", ((float)(after - before))/1000000000);
}

Am I doing things right so far?

FYI: The next optimization I need to do is use SIMD/Intel SSE3

Upvotes: 3

Views: 9840

Answers (2)

Bert te Velde
Bert te Velde

Reputation: 853

If your assignment implies that you MUST transpose, then, of course, you should correct your transpose procedure. As it stands, it does the transpose TWO times, resulting in no transpose at all. The j=loop should not read

j=0; j<size; j++

but

j=0; j<i; j++

Transposing is not necessary to avoid processing the elements of one of the factor-matrices in the "wrong" order. Just interchange the j-loop and the k-loop. Leaving aside for the moment any (other) performance-tuning, the basic loop-structure should be:

  for (int i=0; i<size; i++)
  {
    for (int k=0; k<size; k++)
    {
      double tmp = a[i][k];
      for (int j=0; j<size; j++)
      {
        result[i][j] += tmp * b[k][j];
      }
    }
  }

Upvotes: 4

David Hammen
David Hammen

Reputation: 33106

Am I doing things right so far?

No. You have a problem with your transpose. You should have seen this problem before you started worrying about performance. When you are doing any kind of hacking around for optimizations it always a good idea to use the naive but suboptimal implementation as a test. An optimization that achieves a factor of 100 speedup is worthless if it doesn't yield the right answer.

Another optimization that will help is to pass by reference. You are passing copies. In fact, your matrix result may never get out because you are passing copies. Once again, you should have tested.

Yet another optimization that will help the speedup is to cache some pointers. This is still quite slow:

for(k = 0; k < size; k++)
    tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;

An optimizer might see a way around the pointer problems, but probably not. At least not if you don't use the nonstandard __restrict__ keyword to tell the compiler that your matrices don't overlap. Cache pointers so you don't have to do a.element[i], b.element[j], and result.element[i]. And it still might help to tell the compiler that these arrays don't overlap with the __restrict__ keyword.

Addendum
After looking over the code, it needs help. A minor comment first. You aren't writing C++. Your code is C with a tiny hint of C++. You're using struct rather than class, malloc rather than new, typedef struct rather than just struct, C headers rather than C++ headers.

Because of your implementation of your struct matrix, my comment on slowness due to copy constructors was incorrect. That it was incorrect is even worse! Using the implicitly-defined copy constructor in conjunction with classes or structs that contain naked pointers is playing with fire. You will get burned very badly if someone calls m(a, a, a_squared) to get the square of matrix a. You will get burned even worse if some expects m(a, a, a) to do an in-place computation of a2.

Mathematically, your code only covers a tiny portion of the matrix multiplication problem. What if someone wants to multiply a 100x1000 matrix by a 1000x200 matrix? That's perfectly valid, but your code doesn't handle it because your code only works with square matrices. On the other hand, your code will let someone multiply a 100x100 matrix by a 200x200 matrix, which doesn't make a bit of sense.

Structurally, your code has close to a 100% guarantee that it will be slow because of your use of ragged arrays. malloc can spritz the rows of your matrices all across memory. You'll get much better performance if the matrix is internally represented as a contiguous array but is accessed as if it were a NxM matrix. C++ provides some nice mechanisms for doing just that.

Upvotes: 11

Related Questions