Reputation: 1060
I have the following kernel vectorized for arrays with integers:
long valor = 0, i=0;
__m128i vsum, vecPi, vecCi, vecQCi;
vsum = _mm_set1_epi32(0);
int32_t * const pA = A->data;
int32_t * const pB = B->data;
int sumDot[1];
for( ; i<SIZE-3 ;i+=4){
vecPi = _mm_loadu_si128((__m128i *)&(pA)[i] );
vecCi = _mm_loadu_si128((__m128i *)&(pB)[i] );
vecQCi = _mm_mullo_epi32(vecPi,vecCi);
vsum = _mm_add_epi32(vsum,vecQCi);
}
vsum = _mm_hadd_epi32(vsum, vsum);
vsum = _mm_hadd_epi32(vsum, vsum);
_mm_storeu_si128((__m128i *)&(sumDot), vsum);
for( ; i<SIZE; i++)
valor += A->data[i] * B->data[i];
valor += sumDot[0];
and it works fine. However, if I change the datatype of A and B to short instead of int, shouldn't I use the following code:
long valor = 0, i=0;
__m128i vsum, vecPi, vecCi, vecQCi;
vsum = _mm_set1_epi16(0);
int16_t * const pA = A->data;
int16_t * const pB = B->data;
int sumDot[1];
for( ; i<SIZE-7 ;i+=8){
vecPi = _mm_loadu_si128((__m128i *)&(pA)[i] );
vecCi = _mm_loadu_si128((__m128i *)&(pB)[i] );
vecQCi = _mm_mullo_epi16(vecPi,vecCi);
vsum = _mm_add_epi16(vsum,vecQCi);
}
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
_mm_storeu_si128((__m128i *)&(sumDot), vsum);
for( ; i<SIZE; i++)
valor += A->data[i] * B->data[i];
valor += sumDot[0];
This second kernel doesn't work and I don't know why. I know that all the entries of the vectors in the first and second case are the same (no overflow as well). Can someone help me finding the mistake?
Thanks
Upvotes: 2
Views: 268
Reputation: 13457
Here's a few things I see.
In both the int
and short
case, when you're storing the __m128
to sumDot
, you use _mm_storeu_si128
on targets that are much smaller than 128 bits. This means you've been corrupting memory, and were lucky you were not bitten.
sumDot
is an int[1]
even in the short
case, you were storing two short
s in one int
, and then reading it as an int
.In the short
case you're missing one horizontal vector reduction step. Remember that now that you've got 8 short
s per vector, you must now have log_2(8) = 3 vector reduction steps.
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
(Optional) Since you're onto SSE4.1 already, might as well use one of the goodies it has: The PEXTR*
instructions. They take the index of the lane from which to extract. You're interested in the bottom lane (lane 0) because that's where the sum ends up after your vector reduction.
/* 32-bit */
sumDot[0] = _mm_extract_epi32(vsum, 0);
/* 16-bit */
sumDot[0] = _mm_extract_epi16(vsum, 0);
EDIT: Apparently the compiler doesn't sign-extend the 16-bit word extracted with _mm_extract_epi16
. You must convince it to do so yourself.
/* 32-bit */
sumDot[0] = (int32_t)_mm_extract_epi32(vsum, 0);
/* 16-bit */
sumDot[0] = (int16_t)_mm_extract_epi16(vsum, 0);
EDIT2: I found an even BETTER solution! It uses exactly the instruction we need (PMADDWD
), and is identical to the 32-bit code except that the iteration bounds are different, and instead of _mm_mullo_epi16
you use _mm_madd_epi16
in the loop. This only needs two 32-bit vector reduction stages. http://pastebin.com/A9ibkMwP
_mm_setzero_*()
functions instead of _mm_set1_*(0)
.Upvotes: 2