Reputation: 11
I am new to SSE. I have the problem of transforming this code:
for (i = 0; i < m_; i++) {
for (j = 0; j < n_; j++) {
(*vec)->data[i] += coeficientsI[j] * coefficientsII[j][i];
}
}
into a SSE routine. I have this:
__m128i vecMul, vecC, vecB, vsum;
int B_aux[4];
int multiple = n_/4;
for (i = 0; i < m_; i++) {
vsum = _mm_setzero_si128();
for (j = 0; j < multiple ; j+=4) {
B_aux[0] = coefficientsII[j][i];
B_aux[1] = coefficientsII[j+1][i];
B_aux[2] = coefficientsII[j+2][i];
B_aux[3] = coefficientsII[j+3][i];
vecC = _mm_loadu_si128((__m128i *)&((coefficientsI)[j] ));
vecB = _mm_loadu_si128((__m128i *)&(B_aux) );
vecMul = _mm_mullo_epi32(vecC,vecB);
vsum = _mm_add_epi32(vsum,vecMul);
}
vsum = _mm_hadd_epi32(vsum, vsum);
vsum = _mm_hadd_epi32(vsum, vsum);
(*vec)->data[i] += _mm_extract_epi32(vsum, 0);
for ( ; j < n_ ; j++)
(*vec)->data[i] += coefficientsI[j] * coefficientsII[j][i];
}
But this don't work. Where is the problem?
I want to vectorize the kernel because it was detected as a hotspot. However, it faisl... the result is wrong.
Thanks
Upvotes: 1
Views: 78
Reputation: 33699
Your multiple
cut is wrong. You divide by four but still increment by four then when you clean up in the final loop it's off by a factor of four. This is easy to fix. Define multiple as
int multiple = n & (-4);
If you use your definition of multiple you have to do
for (j = 0; j < multiple ; j++) {
B_aux[0] = coefficientsII[4*j][i];
B_aux[1] = coefficientsII[4*j+1][i];
B_aux[2] = coefficientsII[4*j+2][i];
B_aux[3] = coefficientsII[4*j+3][i];
vecC = _mm_loadu_si128((__m128i *)&((coefficientsI)[4*j] ));
//..
for (j*=4 ; j < n_ ; j++)
(*vec)->data[i] += coefficientsI[j] * coefficientsII[j][i];
See my example here Manually vectorized code 10x slower than auto optimized - what I did wrong?
Note, that your code is not cache friendly when you access coefficientsII[j][i]
. If you can generate the transpose coefficientsII_T and access as coefficientsII_T[i][j]
you probably will see better results.
Upvotes: 1