user3466306
user3466306

Reputation: 11

SSE error in one code

I am new to SSE. I have the problem of transforming this code:

for (i = 0; i < m_; i++) {
    for (j = 0; j < n_; j++) {
        (*vec)->data[i] += coeficientsI[j] * coefficientsII[j][i];
    }
}

into a SSE routine. I have this:

    __m128i vecMul, vecC, vecB, vsum;

    int B_aux[4];

    int multiple = n_/4;

    for (i = 0; i < m_; i++) {

            vsum = _mm_setzero_si128();

            for (j = 0; j < multiple ; j+=4) {

                    B_aux[0] = coefficientsII[j][i];
                    B_aux[1] = coefficientsII[j+1][i];
                    B_aux[2] = coefficientsII[j+2][i];
                    B_aux[3] = coefficientsII[j+3][i];

                    vecC = _mm_loadu_si128((__m128i *)&((coefficientsI)[j] ));
                    vecB = _mm_loadu_si128((__m128i *)&(B_aux) );

                    vecMul = _mm_mullo_epi32(vecC,vecB);

                    vsum = _mm_add_epi32(vsum,vecMul);
            }

            vsum = _mm_hadd_epi32(vsum, vsum);
            vsum = _mm_hadd_epi32(vsum, vsum);

            (*vec)->data[i] += _mm_extract_epi32(vsum, 0);

            for ( ; j < n_ ; j++)
                    (*vec)->data[i] += coefficientsI[j] * coefficientsII[j][i];
    }

But this don't work. Where is the problem?

I want to vectorize the kernel because it was detected as a hotspot. However, it faisl... the result is wrong.

Thanks

Upvotes: 1

Views: 78

Answers (1)

Z boson
Z boson

Reputation: 33699

Your multiple cut is wrong. You divide by four but still increment by four then when you clean up in the final loop it's off by a factor of four. This is easy to fix. Define multiple as

int multiple = n & (-4);

If you use your definition of multiple you have to do

for (j = 0; j < multiple ; j++) {
    B_aux[0] = coefficientsII[4*j][i];
    B_aux[1] = coefficientsII[4*j+1][i];
    B_aux[2] = coefficientsII[4*j+2][i];
    B_aux[3] = coefficientsII[4*j+3][i];
    vecC = _mm_loadu_si128((__m128i *)&((coefficientsI)[4*j] ));
    //..

for (j*=4 ; j < n_ ; j++)
    (*vec)->data[i] += coefficientsI[j] * coefficientsII[j][i];

See my example here Manually vectorized code 10x slower than auto optimized - what I did wrong?

Note, that your code is not cache friendly when you access coefficientsII[j][i]. If you can generate the transpose coefficientsII_T and access as coefficientsII_T[i][j] you probably will see better results.

Upvotes: 1

Related Questions