florian
florian

Reputation: 115

SIMD/SSE : short dot product and short max value

I'm trying to optimize a dot product of two c-style arrays of contant and small size and of type short.

I've read several documentations about SIMD intrinsics and many blog posts/articles about dot product optimization using this intrisincs.

However, i don't understand how a dot product on short arrays using this intrinsics can give the right result. When making the dot product, the computed values can be (and are always, in my case) greater than SHORT_MAX, so is there sum. Hence, i store them in a variable of double type.

As i understand the dot product using simd intrinsic, we use __m128i variables types and operations are returning __m128i. So, what i don't understand is why it doesn't "overflow" and how the result can be transformed into a value type that can handle it?

thanks for your advices

Upvotes: 2

Views: 744

Answers (2)

florian
florian

Reputation: 115

Just for the records, here is how i make an dot product for 2 int16 arrays of size 36:

double dotprod(const int16_t* source, const int16_t* target, const int size){
#ifdef USE_SSE
    int res[4];
    __m128i* src = (__m128i *) source;
    __m128i* t = (__m128i *) target;
    __m128i s = _mm_madd_epi16(_mm_loadu_si128(src), mm_loadu_si128(t));
    ++src;
    ++t;
    s = _mm_add_epi32(s, _mm_madd_epi16(_mm_loadu_si128(src), _mm_loadu_si128(t)));
    ++src;
    ++t;
    s = _mm_add_epi32(s, _mm_madd_epi16(_mm_loadu_si128(src), _mm_loadu_si128(t)));
    ++src;
    ++t;
    s = _mm_add_epi32(s, _mm_madd_epi16(_mm_loadu_si128(src), _mm_loadu_si128(t)));

    /* return the sum of the four 32-bit sub sums */
    _mm_storeu_si128((__m128i*)&res, s);
    return res[0] + res[1] + res[2] + res[3] + source[32] * target[32] + source[33] * target[33] + source[34] * target[34] + source[35] * target[35];
#elif USE_AVX
    int res[8];
    __m256i* src = (__m256i *) source;
    __m256i* t = (__m256i *) target;
    __m256i s = _mm256_madd_epi16(_mm256_loadu_si256(src), _mm256_loadu_si256(t));
    ++src;
    ++t;
    s = _mm256_add_epi32(s, _mm256_madd_epi16(_mm256_loadu_si256(src), _mm256_loadu_si256(t)));

    /* return the sum of the 8 32-bit sub sums */
    _mm256_storeu_si256((__m256i*)&res, s);
    return res[0] + res[1] + res[2] + res[3] + res[4] + res[5] + res[6] + res[7] + source[32] * target[32] + source[33] * target[33] + source[34] * target[34] + source[35] * target[35];
#else
    return source[0] * target[0] + source[1] * target[1] + source[2] * target[2] + source[3] * target[3] + source[4] * target[4]+ source[5] * target[5] + source[6] * target[6] + source[7] * target[7] + source[8] * target[8] + source[9] * target[9] + source[10] * target[10] + source[11] * target[11] + source[12] * target[12] + source[13] * target[13] + source[14] * target[14] + source[15] * target[15] + source[16] * target[16] + source[17] * target[17] + source[18] * target[18] + source[19] * target[19] + source[20] * target[20] + source[21] * target[21] + source[22] * target[22] + source[23] * target[23] + source[24] * target[24] + source[25] * target[25] + source[26] * target[26] + source[27] * target[27] + source[28] * target[28] + source[29] * target[29] + source[30] * target[30] + source[31] * target[31] + source[32] * target[32] + source[33] * target[33] + source[34] * target[34] + source[35] * target[35];
#endif
}

Upvotes: 0

Paul R
Paul R

Reputation: 213160

Depending on the range of your data values you might use an intrinsic such as _mm_madd_epi16, which performs multiply/add on 16 bit data and generates 32 bit terms. You would then need to periodically accumulate your 32 bit terms to 64 bits. How often you need to do this depends on the range of your input data, e.g. if it's 12 bit greyscale image data then you can do 64 iterations at 8 elements per iteration (i.e. 512 input points) before there is the potential for overflow. In the worst case however, if your input data uses the full 16 bit range, then you would need to do the additional 64 bit accumulate on every iteration (i.e. every 8 points).

Upvotes: 2

Related Questions