Why vector length SIMD code is slower than plain C

Question

Why is my SIMD vector4 length function 3x slower than a naive vector length method?

SIMD vector4 length function:

__extern_always_inline float vec4_len(const float *v) {
    __m128 vec1 = _mm_load_ps(v);
    __m128 xmm1 = _mm_mul_ps(vec1, vec1);
    __m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
    __m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
    return sqrtf(_mm_cvtss_f32(xmm3));
}

Naive implementation:

sqrtf(V[0] * V[0] + V[1] * V[1] + V[2] * V[2] + V[3] * V[3])

The SIMD version took 16110ms to iterate 1000000000 times. The naive version was ~3 times faster, it takes only 4746ms.

#include 
#include 
#include 
#include 
#include 

static float vec4_len(const float *v) {
    __m128 vec1 = _mm_load_ps(v);
    __m128 xmm1 = _mm_mul_ps(vec1, vec1);
    __m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
    __m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
    return sqrtf(_mm_cvtss_f32(xmm3));
}

int main() {
    float A[4] __attribute__((aligned(16))) = {3, 4, 0, 0};

    struct timespec t0 = {};
    clock_gettime(CLOCK_MONOTONIC, &t0);

    double sum_len = 0;
    for (uint64_t k = 0; k < 1000000000; ++k) {
        A[3] = k;
        sum_len += vec4_len(A);
//        sum_len += sqrtf(A[0] * A[0] + A[1] * A[1] + A[2] * A[2] + A[3] * A[3]);
    }
    struct timespec t1 = {};
    clock_gettime(CLOCK_MONOTONIC, &t1);

    fprintf(stdout, "%f
", sum_len);

    fprintf(stdout, "%ldms
", (((t1.tv_sec - t0.tv_sec) * 1000000000) + (t1.tv_nsec - t0.tv_nsec)) / 1000000);

    return 0;
}

I run with the following command on an Intel(R) Core(TM) i7-8550U CPU. First with the vec4_len version then with the plain C.

I compile with GCC (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0:

gcc -Wall -Wextra -O3 -msse -msse3 sse.c -lm && ./a.out

SSE version output:

499999999500000128.000000
13458ms

Plain C version output:

499999999500000128.000000
4441ms

Peter Cordes · Accepted Answer

The most obvious problem is using an inefficient dot-product (with haddps which costs 2x shuffle uops + 1x add uop) instead of shuffle + add. See Fastest way to do horizontal float vector sum on x86 for what to do after _mm_mul_ps that doesn't suck as much. But still this is just not something x86 can do very efficiently.

But anyway, the real problem is your benchmark loop.

A[3] = k; and then using _mm_load_ps(A) creates a store-forwarding stall, if it compiles naively instead of to a vector shuffle. A store + reload can be efficiently forwarded with ~5 cycles of latency if the load only loads data from a single store instruction, and no data outside that. Otherwise it has to do a slower scan of the whole store buffer to assemble bytes. This adds about 10 cycles of latency to the store-forwarding.

I'm not sure how much impact this has on throughput, but could be enough to stop out-of-order exec from overlapping enough loop iterations to hide the latency and only bottleneck on sqrtss shuffle throughput.

(Your Coffee Lake CPU has 1 per 3 cycle sqrtss throughput, so surprisingly SQRT throughput is not your bottleneck.¹ Instead it will be shuffle throughput or something else.)

See Agner Fog's microarch guide and/or optimization manual.

What does "store-buffer forwarding" mean in the Intel developer's manual?
How does store to load forwarding happens in case of unaligned memory access?
Can modern x86 implementations store-forward from more than one prior store?
Why would a compiler generate this assembly? quotes Intel's optimization manual re: store forwarding. (In that question, and old gcc version stored the 2 dword halves of an 8-byte struct separately, then copied the struct with a qword load/store. Super braindead.)

Plus you're biasing this even more against SSE by letting the compiler hoist the computation of V[0] * V[0] + V[1] * V[1] + V[2] * V[2] out of the loop.

That part of the expression is loop-invariant, so the compiler only has to do (float)k squared, add, and a scalar sqrt every loop iteration. (And convert that to double to add to your accumulator).

(@StaceyGirl's deleted answer pointed this out; looking over the code of the inner loops in it was a great start on writing this answer.)

Extra inefficiency in A[3] = k in the vector version

GCC9.1's inner loop from Kamil's Godbolt link looks terrible, and seems to include a loop-carried store/reload to merge a new A[3] into the 8-byte A[2..3] pair, further limiting the CPU's ability to overlap multiple iterations.

I'm not sure why gcc thought this was a good idea. It would maybe help on CPUs that split vector loads into 8-byte halves (like Pentium M or Bobcat) to avoid store-forwarding stalls. But that's not a sane tuning for "generic" modern x86-64 CPUs.

.L18:
        pxor    xmm4, xmm4
        mov     rdx, QWORD PTR [rsp+8]     ; reload A[2..3]
        cvtsi2ss        xmm4, rbx
        mov     edx, edx                   ; truncate RDX to 32-bit
        movd    eax, xmm4                  ; float bit-pattern of (float)k
        sal     rax, 32
        or      rdx, rax                   ; merge the float bit-pattern into A[3]
        mov     QWORD PTR [rsp+8], rdx     ; store A[2..3] again

        movaps  xmm0, XMMWORD PTR [rsp]    ; vector load: store-forwarding stall
        mulps   xmm0, xmm0
        haddps  xmm0, xmm0
        haddps  xmm0, xmm0
        ucomiss xmm3, xmm0
        movaps  xmm1, xmm0
        sqrtss  xmm1, xmm1
        ja      .L21             ; call sqrtf to set errno if needed; flags set by ucomiss.
.L17:

        add     rbx, 1
        cvtss2sd        xmm1, xmm1
        addsd   xmm2, xmm1            ; total += (double)sqrtf
        cmp     rbx, 1000000000
        jne     .L18                ; }while(k<1000000000);

This insanity isn't present in the scalar version.

Either way, gcc did manage to avoid the inefficiency of a full uint64_t -> float conversion (which x86 doesn't have in hardware until AVX512). It was presumably able to prove that using a signed 64-bit -> float conversion would always work because the high bit can't be set.

Footnote 1: But sqrtps has the same 1 per 3 cycle throughput as scalar, so you're only getting 1/4 of your CPU's sqrt throughput capability by doing 1 vector at a time horizontally, instead of doing 4 lengths for 4 vectors in parallel.

Why vector length SIMD code is slower than plain C

Answers (1)

Extra inefficiency in A[3] = k in the vector version

Related Questions