SSE works on the array that the number of the elements is not the multiple of four

Question

everyone.

My question is if I have three arrays as following

float a[7] = {1.0, 2.0, 3.0, 4.0, 
              5.0, 6.0, 7.0};
float b[7] = {2.0, 2.0, 2.0, 2.0,
              2.0, 2.0, 2.0};
float c[7] = {0.0, 0.0, 0.0, 0.0,
              0.0, 0.0, 0.0};

And I want to perform element-wise multiply operation as following

c[i] = a[i] * b[i], i = 0, 1, ..., 6

For the first four element, I can use SSE intrinsics as following

__m128* sse_a = (__m128*) &a[0];
__m128* sse_b = (__m128*) &b[0];
__m128* sse_c = (__m128*) &c[0];

*sse_c = _mm_mul_ps(*sse_a, *sse_b);

And the content in c will be

c[0] = 2.0, c[1] = 4.0, c[2] = 6.0, c[3] = 8.0
c[4] = 0.0, c[5] = 0.0, c[6] = 0.0

Remaining three numbers in index 4, 5, and 6, I use following code to perform element-wise multiply operation

sse_a = (__m128*) &a[4];
sse_b = (__m128*) &b[4];
sse_c = (__m128*) &c[4];

float mask[4] = {1.0, 1.0, 1.0, 0.0};
__m128* sse_mask = (__m128*) &mask[0];

*sse_c = _mm_add_ps( *sse_c, 
    _mm_mul_ps( _mm_mul_ps(*sse_a, *sse_b), *sse_mask ) );

And the content in c[4-6] will be

c[4] = 10.0, c[5] = 12.0, c[6] = 14.0, which is the expected result.

_mm_add_ps() add four floating-point in parallel, and the first, second, and third floating-point number are allocated in index 4, 5, and 6 in array a, b, and c respectively. But the fourth floating-point number is not allocated to the arrays. To avoid invalid memory access, I multiply on sse_mask to make the fourth number be zero before add the result back to sse_c (array c).

But I'm wondering whether it is safe?

Many thanks.

Loufylouf · Accepted Answer

You seem to have the mathematical operations right but I'm really not sure using casts like you do is the way to go to load and store data in __m128 vars.

Loading and storing

To load data from an array to a __m128 variable, you should use either __m128 _mm_load_ps (float const* mem_addr) or __m128 _mm_loadu_ps (float const* mem_addr) . Pretty easy to figure what's what here, but a few precisions :

For operations involving an access or manipulation of memory, you usualy have two functions doing the same thing, for exemple load and loadu . The first requires your memory to be aligned on a 16-byte boundary, while the u version does not have this requirement. If you don't know about memory alignement, use the u versions.
You also have load_ps and load_pd. The difference : the s stands for single as in single precision (good old float), the d stands for double as in double precision. Of course, you can only puts two doubles per __m128 variable, but 4 floats.

So loading data from an array is pretty easy, just do : __m128* sse_a = _mm_loadu_ps(&a[0]);. Do the same for b, but for c that really depends. If you only want to have the result of the multiplication in it, it's useless to initialize it at 0, load it, then add the result of the multiplication to it then finally get it back.

You should use the pending operation of load for storing data which is void _mm_storeu_ps (float* mem_addr, __m128 a). So once the mutliplication is done and the result in sse_c, just do _mm_storeu_ps(&c[0@, sse_c) ;

Algorithm

The idea behind using the mask is good but you have something easier : load ans store data from a[3] (same for b and c). That way, it will have 4 elements, so there will be no need to use any mask? Yes one operation has already have done on the third element but that will be completely transparent : the store operation will just replace the old value with the new one. Since both are equal, that's not a problem.

One alternative is to store 8 elements in your array even if you need only 7. That way you don't have to worry about memory being allocated or not, no need for special logic like above for the cost of 3 floats, which is nothing on all recent computers.

SSE works on the array that the number of the elements is not the multiple of four

Answers (1)

Loading and storing

Algorithm

Related Questions