SIMD/SSE : short dot product and short max value

Question

I'm trying to optimize a dot product of two c-style arrays of contant and small size and of type short.

I've read several documentations about SIMD intrinsics and many blog posts/articles about dot product optimization using this intrisincs.

However, i don't understand how a dot product on short arrays using this intrinsics can give the right result. When making the dot product, the computed values can be (and are always, in my case) greater than SHORT_MAX, so is there sum. Hence, i store them in a variable of double type.

As i understand the dot product using simd intrinsic, we use __m128i variables types and operations are returning __m128i. So, what i don't understand is why it doesn't "overflow" and how the result can be transformed into a value type that can handle it?

thanks for your advices

Paul R · Accepted Answer

Depending on the range of your data values you might use an intrinsic such as _mm_madd_epi16, which performs multiply/add on 16 bit data and generates 32 bit terms. You would then need to periodically accumulate your 32 bit terms to 64 bits. How often you need to do this depends on the range of your input data, e.g. if it's 12 bit greyscale image data then you can do 64 iterations at 8 elements per iteration (i.e. 512 input points) before there is the potential for overflow. In the worst case however, if your input data uses the full 16 bit range, then you would need to do the additional 64 bit accumulate on every iteration (i.e. every 8 points).

SIMD/SSE : short dot product and short max value

Answers (2)

Related Questions