Reputation: 369

SSE intrinsics without compiler optimization

I am new to SSE intrinsics and try to optimise my code by it. Here is my program about counting array elements which are equal to the given value.

I changed my code to SSE version but the speed almost doesn't change. I am wondering whether I use SSE in a wrong way...

This code is for an assignment where we're not allowed to enable compiler optimization options.

No SSE version:

int get_freq(const float* matrix, float value) {

    int freq = 0;

    for (ssize_t i = start; i < end; i++) {
        if (fabsf(matrix[i] - value) <= FLT_EPSILON) {
            freq++;
        }
    }

    return freq;
}

SSE version:

#include <immintrin.h>
#include <math.h>
#include <float.h>

#define GETLOAD(n) __m128 load##n = _mm_load_ps(&matrix[i + 4 * n])
#define GETEQU(n) __m128 check##n = _mm_and_ps(_mm_cmpeq_ps(load##n, value), and_value)
#define GETCOUNT(n) count = _mm_add_ps(count, check##n)

    int get_freq(const float* matrix, float givenValue, ssize_t g_elements) {

        int freq = 0;
        int i;

        __m128 value = _mm_set1_ps(givenValue);
        __m128 count = _mm_setzero_ps();
        __m128 and_value = _mm_set1_ps(0x00000001);


        for (i = 0; i + 15 < g_elements; i += 16) {
            GETLOAD(0); GETLOAD(1); GETLOAD(2); GETLOAD(3);
            GETEQU(0);  GETEQU(1);  GETEQU(2);  GETEQU(3);
            GETCOUNT(0);GETCOUNT(1);GETCOUNT(2);GETCOUNT(3);
        }

        __m128 shuffle_a = _mm_shuffle_ps(count, count, _MM_SHUFFLE(1, 0, 3, 2));
        count = _mm_add_ps(count, shuffle_a);
        __m128 shuffle_b = _mm_shuffle_ps(count, count, _MM_SHUFFLE(2, 3, 0, 1));
        count = _mm_add_ps(count, shuffle_b);
        freq = _mm_cvtss_si32(count);


        for (; i < g_elements; i++) {
            if (fabsf(matrix[i] - givenValue) <= FLT_EPSILON) {
                freq++;
            }
        }

        return freq;
    }

Upvotes: 1

Answers (2)

Peter Cordes

Reputation: 365791

If you need to compile with -O0, then do as much as possible in a single statement. In normal code, int a=foo(); bar(a); will compile to the same asm as bar(foo()), but in -O0 code, the second version will probably be faster, because it doesn't store the result to memory and then reload it for the next statement.

-O0 is designed to give the most predictable results from debugging, which is why everything is stored to memory after every statement. This is obviously horrible for performance.

I wrote a big answer a while ago for a different question from someone else with a stupid assignment like yours that required them to optimize for -O0. Some of that may help.

Don't try too hard on this assignment. Probably most of the "tricks" that you figure out that make your code run faster with -O0 will only matter for -O0, but make no difference with optimization enabled.

In real life, code is typically compiled with clang or gcc -O2 at least, and sometimes -O3 -march=haswell or whatever to auto-vectorize. (Once it's debugged and you're ready to optimize.)

Re: your update:

Now it compiles, and the horrible asm from the SSE version can be seen. I put it on godbolt along with a version of the scalar code that actually compiles, too. Intrinsics usually compile very badly with optimization disabled, with the inline functions still having args and return values that result in actual load/store round trips (store-forwarding latency) even with __attribute__((always_inline)). See Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled for example.

The scalar version comes out a lot less bad. Its source does everything in one expression, so temporaries stay in registers. The loop counter is still in memory, though, bottlenecking it to at best one iteration per 6 cycles on Haswell, for example. (See the x86 tag wiki for optimization resources.)

BTW, a vectorized fabsf() is easy, see Fastest way to compute absolute value using SSE. That and an SSE compare for less-than should do the trick to give you the same semantics as your scalar code. (But makes it even harder to get -O0 to not suck).

You might do better just manually unrolling your scalar version one or two times, because -O0 sucks too much.

Upvotes: 4

MirekE

Reputation: 11555

Some compilers are pretty good about doing optimization of vectors. Did you check the generated assembly of optimized build of both versions? Isn't the "naive" version actually using SIMD or other optimization techniques?

Upvotes: 1

SSE intrinsics without compiler optimization

Answers (2)

Related Questions