Ricky
Ricky

Reputation: 1683

SSE instruction within nested for loops

i have several nested for loops in my code and i try to use intel SSE instructions on an intel i7 core to speed up the application. The code structure is as follows (val is set in a higher for loop):

_m128 in1,in2,tmp1,tmp2,out;
float arr[4] __attribute__ ((aligned(16)));
val = ...;

... several higher for loops ...
for(f=0; f<=fend; f=f+4){
    index2 = ...;
    for(i=0; i<iend; i++){
        for(j=0; j<jend; j++){
            inputval = ...;
            index = ...;
            if(f<fend-4){
                arr[0] = array[index];
                arr[1] = array[index+val];
                arr[2] = array[index+2*val];
                arr[3] = array[index+3*val];
                in1  = _mm_load_ps(arr);
                in2  = _mm_set_ps1(inputval);
                tmp1 = _mm_mul_ps(in1, in2);
                tmp2 = _mm_loadu_ps(&array2[index2]);
                out  = _mm_add_ps(tmp1,tmp2);
                _mm_storeu_ps(&array2[index2], out);
            } else {
                //if no 4 values available for SSE instruction execution execute serial code
                for(int u = 0; u < fend-f; u++ ) array2[index2+u] += array[index+u*val] * inputval;
            }
        }
    }
}

I think there are two main problems: the buffer used for aligning the values from 'array', and the fact that when no 4 values are left (e.g. when fend = 6, two values are left over which should be executed with the sequential code). Is there any other way of loading the values from in1 and/or executing SSE intructions with 3 or 2 values?


Thanks for the answers so far. The loading is as good as it gets i think, but is there any workaround for the 'leftover' part within the else statement that could be solved using SSE instructions?

Upvotes: 0

Views: 757

Answers (3)

Walter
Walter

Reputation: 45414

if you want full benefit form SSE (factor 4 or more faster than best optimised code without explicit usage of SSE), you must ensure that your data layout such that you only ever need aligned loads and stores. Though using _mm_set_ps(w,z,y,x) in your code snippet may help, you should avoid the need for this, i.e. avoid strided accesses (they are less efficient than a single _mm_load_ps).

As for the problem of the last few<4 elements, I usually ensure that all my data are not only 16-byte aligned, but also array sizes are multiples of 16 bytes, such that I never have such spare remaining elements. Of course, the real problem may have spare elements, but that data can usually be set such that they don't cause a problem (set to the neutral elements, i.e. zero for additive operations). In rare cases, you only want to work on a subset of the array which starts and/or ends at an unaligned position. In this case one may use bitwise operations (_mm_and_ps, _mm_or_ps) to suppress operations on the unwanted elements.

Upvotes: 0

Mysticial
Mysticial

Reputation: 471209

I think the bigger problem is that there is so little computation for such a massive amount of data movement:

arr[0] = array[index];                   //  Data Movement
arr[1] = array[index+val];               //  Data Movement
arr[2] = array[index+2*val];             //  Data Movement
arr[3] = array[index+3*val];             //  Data Movement
in1  = _mm_load_ps(arr);                 //  Data Movement
in2  = _mm_set_ps1(inputval);            //  Data Movement
tmp1 = _mm_mul_ps(in1, in2);             //  Computation
tmp2 = _mm_loadu_ps(&array2[index2]);    //  Data Movement
out  = _mm_add_ps(tmp1,tmp2);            //  Computation
_mm_storeu_ps(&array2[index2], out);     //  Data Movement

While it "might" be possible to simplify this. I'm not at all convinced that vectorization is going to be beneficial at all in this situation.

You'll have to change your data layout to make avoid the strided access index + n*val.

Or you can wait until AVX2 gather/scatter instructions become available in 2013?

Upvotes: 2

Paul R
Paul R

Reputation: 212929

You can express this:

            arr[0] = array[index];
            arr[1] = array[index+val];
            arr[2] = array[index+2*val];
            arr[3] = array[index+3*val];
            in1  = _mm_load_ps(arr);

more succinctly as:

            in1  = _mm_set_ps(array[index+3*val], array[index+2*val], array[index+val], array[index]);

and get rid of arr, which might give the compiler some opportunity to optimise away some redundant loads/stores.

However your data organisation is the main problem, compounded by the fact that you are doing almost no computation relative to the number of loads and stores, two of which are unaligned. If possible you need to re-organise your data structures so that you can load and store 4 elements at a time form aligned contiguous memory in all cases, otherwise any computational benefits will tend to be outweighed by inefficient memory access patterns.

Upvotes: 1

Related Questions