sney-js
sney-js

Reputation: 1097

How can I write SSE instruction in C for two for loops?

Using the library 'immintrin.h', I am able to write SSE instruction for simple for loops and operations. However, how can I write SSE instructions for the shown statement?

for (int i =0; i<n; i++){
   for (int j=0; j<n; j++) {
     x[i] += a[i] + a[j];
}}

x and a are float* initialised using _mm_malloc(). memory access pattern can be used as __m128 and an unrolling strategy for 4 bytes.

I'm sorry if I'm not too clear, but just like

for (int i = 0; i < vecsize; i+=4) {
        __m128 a  = _mm_load_ps(a+i);
        __m128 x  = _mm_add_ps(x,a);
        _mm_store_ps(x+i, x);
    }

(which is for 1 loop only), I'd like something similar for the loops shown above.

Edit: I (EricPostpischil) am injecting this text from a comment, because it is important to the problem statement. The author, NeilDA, should expand upon this:

… in my program 'a' is always changing and hence I want 'x' that changes with it.

I HAVE MANAGED TO DO IT!! I submitted the answer..

Upvotes: 1

Views: 724

Answers (4)

scottt
scottt

Reputation: 7228

Transform:

    for (i = 0; i < n; i++)
        for (j = 0; j < n; j++)
            x[i] += a[i] + a[j];

into:

    for (i = 0; i < n; i++)
        x[i] += n*a[i] + sum(a));

See f_sse() in the code below:

#include <stdio.h>
#include <string.h>
#include <immintrin.h>

enum {
    N = 4,
};

float x[N], a[N] = { .1, .2, .3, .4 }, y[N];

void f(float *x, float *a, int n)
{
    int i, j;
    for (i = 0; i < n; i++)
        for (j = 0; j < n; j++)
            x[i] += a[i] + a[j];
}

float array_sum(float *a, int n)
{
    /* could be vectorized as well */
    int i;
    float s;
    for (s = 0, i = 0; i < n; i++)
        s += a[i];
    return s;
}

void f_sse(float *x, float *a, int n)
{
    int i, l;
    float t;
    __m128 sum_a, n_vec;
    t = array_sum(a, n);
    sum_a = _mm_set1_ps(t);
    n_vec = _mm_set1_ps(n);
    l = n / 4;
    for (i = 0; i < l; i += 4) {
        __m128 ai, xi;
        ai = _mm_load_ps(a + i);
        xi = _mm_load_ps(x + i);
        ai = _mm_mul_ps(ai, n_vec);
        ai = _mm_add_ps(ai, sum_a);
        xi = _mm_add_ps(xi, ai);
        _mm_store_ps(x + i, xi);
    }
}

int main()
{
    int i, r;
    f(x, a, N);
    f_sse(y, a, N);
    r = memcmp(x, y, N);
    if (r == 0)
        return 0;
    printf("x: { ");
    for (i = 0; i < N; i++)
        printf("%f, ", x[i]);
    printf("}\n");
    printf("y: { ");
    for (i = 0; i < N; i++)
        printf("%f, ", y[i]);
    printf("}\n");
    return 3;
}

Since you state that a is being updated concurrently and you can't do the loop transform above. You need to make a conscious decision on when to take updates from a. It'll never match the original non-vectorized version since we're loading 4 floats at a time.

Re-loads a[j] in inner loop:

void f_sse(float *x, float *a, int n)
{
    int i, j, l;
    l = n / 4;
    for (i = 0; i < l; i += 4) {
        __m128 ai, xi;
        ai = _mm_load_ps(a + i);
        xi = _mm_load_ps(x + i);
        for (j = 0; j < n; j++) {
            /* re-load a[j] to get updates */
            xi = _mm_add_ps(xi, _mm_add_ps(ai, _mm_set1_ps(((volatile float *)a)[j])));
        }
        _mm_store_ps(x + i, xi);
    }

}

Re-loads a[j] and (a + i) in inner loop:

void f_sse(float *x, float *a, int n)
{
    int i, j, l;
    l = n / 4;
    for (i = 0; i < l; i += 4) {
        __m128 ai, xi;
        xi = _mm_load_ps(x + i);
        for (j = 0; j < n; j++) {
            /* re-load (a + i) to get updates */
            ai = _mm_load_ps(a + i);
            /* re-load a[j] to get updates */
            xi = _mm_add_ps(xi, _mm_add_ps(ai, _mm_set1_ps(((volatile float *)a)[j])));
        }
        _mm_store_ps(x + i, xi);
    }
}

Upvotes: 0

sney-js
sney-js

Reputation: 1097

__m128 x_v, a_v, aj_v;
for (int i = 0; i < (vec_size); i+=4) {
            x_v = _mm_load_ps(x + i);
            a_v = _mm_load_ps(a+i);
        for (int j = 0; j < N; j++) {
            aj_v = _mm_set1_ps(a[j]);
            x_v = _mm_add_ps(x_v, _mm_add_ps(aj_v, a_v));
        }
            _mm_store_ps(x + i, x_v);
    }

I don't know if it can be improved even further, but this cut down my time from 0.17 s to 0.04s :D Any remarks or ways to improve it would be great!

Upvotes: 0

Chris
Chris

Reputation: 556

__m128 *mx = (__m128*)x;
__m128 *ma = (__m128*)a;
__m128 temp_a;

for (int i = 0; i < (n>>2); ++i) {
    for (int j = 0; j < (n>>2); ++j) {
        temp_a = _mm_add_ps(*(ma+i), *(ma+j));
        *mx = _mm_add_ps(*mx, temp_a);
    }
    mx++;
}

You would have to make sure that n is a multiple of 4 of course, and make sure that x is initialized to 0 so that the accumulation is correct.

Upvotes: 0

Eric Postpischil
Eric Postpischil

Reputation: 222272

This is only a partial answer, but it is too long/detailed for a comment.

I question whether your problem is written correctly. As shown, for each x[i], it adds a[i] n times and adds each a[j] once, for 0≤j<n. So it is equivalent to:

sum = 0;
for (j = 0; j < n; ++j)
    sum += a[j];
for (i = 0; i < n; ++i)
    x[i] += n*a[i] + sum;

This would be implemented with much simpler SSE code than other likely array operations. And, of course, simply rewriting it as above would produce much faster code than the original formulation.

Upvotes: 3

Related Questions