Matrix Multiplication using vectorized c++

Question

I'm trying to write a c++ code to do a matrix multiplication using SIMD but the result is wrong here is my code

    void mat_sse(DATA m1[][SIZE], DATA m2[][SIZE], DATA mout[][SIZE])
{

    DATA prod = 0;

    __m128 X, Y, Z, M, N;

    for(int i=0; i



where Size is 
const int SIZE = 40; 
 could you please  help ?

user555045 · Accepted Answer

There is a lot wrong with that.

for(int k=0; k< SIZE; k=k+4){
    for( int j=0; j



Both loops advance by 4, so the body of the inner loop handles 16 steps of the old scalar loop at once. Except it doesn't, it does "four things".

And they're not the right things:

X = _mm_load_ps(&m1[i][k]);
Y = _mm_load_ps(&m2[k][j]);
M = _mm_mul_ps(X, Y);


So every iteration of the inner loop takes the same tiny row vector out of m1, and the next tiny row vector out of m2, and then multiplies them pointwise. That doesn't work. For example, if we had two 4x4 matrixes: (shown partially)

A B C D   X Y Z W
E . . .   S . . .
I . . . × T . . .
M . . .   U . . .


An iteration of the inner loop would calculate AX, BY, CZ and DW. AX indeed should be in the result, but a real matrix multiplication doesn't involve BY: the rows of m1 are combined with the columns of m2, so BY and so on where the second entry in a row of m1 is multiplied by the first entry in a column of m2, cannot happen. There are many different ways to arrange that calculation, but the way implemented here is not a rearrangement, it computes some wrong products and it skips many necessary products.

It is convenient to load a little row from m2, and broadcast a single entry from m1. That way, the product is a little row in mout, so it can be accumulated and written to the result without further shuffling.

By the way you sort of did that last part already,

mout[i][j] += Z[0];
mout[i][j+1] += Z[1];
mout[i][j+2] += Z[2];
mout[i][j+3] += Z[3];


.. but having it in the loop is bad, and it only makes sense to do when the result of the product are numbers that should be summed into those locations. This load/sum/store thing was in the inner loop because the inner loop was the j loop, but that can be fixed by exchanging the j and k loops: (not tested)

for (int i = 0; i < SIZE; i++) {
    for (int j = 0; j < SIZE; j += 4) {
        __m128 sum = _mm_setzero_ps();
        for (int k = 0; k < SIZE; k++) {
            __m128 entry = _mm_set1_ps(m1[i][k]);
            __m128 row  = _mm_load_ps(&m2[k][j]);
            sum = _mm_add_ps(sum, _mm_mul_ps(entry, row));
        }
        _mm_store_ps(&mout[i][j], sum);
    }
}


That code is still slow, for various reasons:


the loop-carried dependency through addps is slower than the available throughput. Use more independent accumulators.
too many loads per arithmetic operation.
for mid-to-large matrixes, use cache blocking. Not when size = 40 though.

Matrix Multiplication using vectorized c++

Answers (2)

Related Questions