Intel Intrinsics code optimization

Question

So i'm trying to multiply a constant with short int a[101] with intel intrinsics. I have done it with addition but i can't seem to figure why it wont work with multiplication. Also before we used ints of 32 bits and now we use 16 bit short so we can have double as many values in the intrinsics to fill the 128 bit as far as i understand?

naive example of what im trying to do:

int main(int argc, char **argv){
    short int a[101];
    int len = sizeof(a)/sizeof(short);

    /*Populating array a with values 1 to 101*/

    mult(len, a);

    return 0;
}

int mult(int len, short int *a){
    int result = 0;
    for(int i=0; i



And my code trying to do the same in intrinsics

/*Same main as before with a short int a[101] containing values 1 to 101*/

int SIMD(int len, short int *a){
    int res;
    int val[4];

    /*Setting constant value to mulitply with*/
    __m128i sum = _mm_set1_epi16(20);
    __m128i s = _mm_setzero_si128( );

    for(int i=0; i


So i do get a number out as result, but the number does not match the naive method, i have tried other intrinsics and changing numbers to see if it makes any noticable difference but nothing comes close to the output i expect. The computation time is almost the same as the naive at the moment aswell.

Eric Postpischil · Accepted Answer

There are 8 short in one __m128i. So:

for(int i=0; i



should be

for(int i=0; i


and:

res += val[0] + val[1] + val[2] + val[3];


should be:

res += val[0] + val[1] + val[2] + val[3] + val[4] + val[5] + val[6] + val[7];


and:

for(int i=len/4*4; i


should be:

for(int i=len/8*8; i


In:

s += _mm_mul_epu32(vec,sum);


_mm_mul_epu32 operates on 32-bit elements. It should be:

s += _mm_mullo_epi16(vec, sum);


The object res is not initialized; it should be:

int res = 0;


Here is working code:

#include 
#include 

#include 

//  Number of elements in an array.
#define NumberOf(x) (sizeof (x) / sizeof *(x))


//  Compute the result with scalar arithmetic.
static int mult(int len, short int *a)
{
    int result = 0;
    for (size_t i=0; i

Intel Intrinsics code optimization

Answers (1)

Related Questions