Reputation: 43
Why is my SIMD vector4 length function 3x slower than a naive vector length method?
SIMD vector4 length function:
__extern_always_inline float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
Naive implementation:
sqrtf(V[0] * V[0] + V[1] * V[1] + V[2] * V[2] + V[3] * V[3])
The SIMD version took 16110ms to iterate 1000000000 times. The naive version was ~3 times faster, it takes only 4746ms.
#include <math.h>
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>
static float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
int main() {
float A[4] __attribute__((aligned(16))) = {3, 4, 0, 0};
struct timespec t0 = {};
clock_gettime(CLOCK_MONOTONIC, &t0);
double sum_len = 0;
for (uint64_t k = 0; k < 1000000000; ++k) {
A[3] = k;
sum_len += vec4_len(A);
// sum_len += sqrtf(A[0] * A[0] + A[1] * A[1] + A[2] * A[2] + A[3] * A[3]);
}
struct timespec t1 = {};
clock_gettime(CLOCK_MONOTONIC, &t1);
fprintf(stdout, "%f\n", sum_len);
fprintf(stdout, "%ldms\n", (((t1.tv_sec - t0.tv_sec) * 1000000000) + (t1.tv_nsec - t0.tv_nsec)) / 1000000);
return 0;
}
I run with the following command on an Intel(R) Core(TM) i7-8550U CPU. First with the vec4_len
version then with the plain C.
I compile with GCC (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0:
gcc -Wall -Wextra -O3 -msse -msse3 sse.c -lm && ./a.out
SSE version output:
499999999500000128.000000
13458ms
Plain C version output:
499999999500000128.000000
4441ms
Upvotes: 4
Views: 1406
Reputation: 364220
The most obvious problem is using an inefficient dot-product (with haddps
which costs 2x shuffle uops + 1x add uop) instead of shuffle + add. See Fastest way to do horizontal float vector sum on x86 for what to do after _mm_mul_ps
that doesn't suck as much. But still this is just not something x86 can do very efficiently.
But anyway, the real problem is your benchmark loop.
A[3] = k;
and then using _mm_load_ps(A)
creates a store-forwarding stall, if it compiles naively instead of to a vector shuffle. A store + reload can be efficiently forwarded with ~5 cycles of latency if the load only loads data from a single store instruction, and no data outside that. Otherwise it has to do a slower scan of the whole store buffer to assemble bytes. This adds about 10 cycles of latency to the store-forwarding.
I'm not sure how much impact this has on throughput, but could be enough to stop out-of-order exec from overlapping enough loop iterations to hide the latency and only bottleneck on sqrtss
shuffle throughput.
(Your Coffee Lake CPU has 1 per 3 cycle sqrtss
throughput, so surprisingly SQRT throughput is not your bottleneck.1 Instead it will be shuffle throughput or something else.)
See Agner Fog's microarch guide and/or optimization manual.
Plus you're biasing this even more against SSE by letting the compiler hoist the computation of V[0] * V[0] + V[1] * V[1] + V[2] * V[2]
out of the loop.
That part of the expression is loop-invariant, so the compiler only has to do (float)k
squared, add, and a scalar sqrt every loop iteration. (And convert that to double
to add to your accumulator).
(@StaceyGirl's deleted answer pointed this out; looking over the code of the inner loops in it was a great start on writing this answer.)
GCC9.1's inner loop from Kamil's Godbolt link looks terrible, and seems to include a loop-carried store/reload to merge a new A[3]
into the 8-byte A[2..3]
pair, further limiting the CPU's ability to overlap multiple iterations.
I'm not sure why gcc thought this was a good idea. It would maybe help on CPUs that split vector loads into 8-byte halves (like Pentium M or Bobcat) to avoid store-forwarding stalls. But that's not a sane tuning for "generic" modern x86-64 CPUs.
.L18:
pxor xmm4, xmm4
mov rdx, QWORD PTR [rsp+8] ; reload A[2..3]
cvtsi2ss xmm4, rbx
mov edx, edx ; truncate RDX to 32-bit
movd eax, xmm4 ; float bit-pattern of (float)k
sal rax, 32
or rdx, rax ; merge the float bit-pattern into A[3]
mov QWORD PTR [rsp+8], rdx ; store A[2..3] again
movaps xmm0, XMMWORD PTR [rsp] ; vector load: store-forwarding stall
mulps xmm0, xmm0
haddps xmm0, xmm0
haddps xmm0, xmm0
ucomiss xmm3, xmm0
movaps xmm1, xmm0
sqrtss xmm1, xmm1
ja .L21 ; call sqrtf to set errno if needed; flags set by ucomiss.
.L17:
add rbx, 1
cvtss2sd xmm1, xmm1
addsd xmm2, xmm1 ; total += (double)sqrtf
cmp rbx, 1000000000
jne .L18 ; }while(k<1000000000);
This insanity isn't present in the scalar version.
Either way, gcc did manage to avoid the inefficiency of a full uint64_t
-> float
conversion (which x86 doesn't have in hardware until AVX512). It was presumably able to prove that using a signed 64-bit -> float conversion would always work because the high bit can't be set.
Footnote 1: But sqrtps
has the same 1 per 3 cycle throughput as scalar, so you're only getting 1/4 of your CPU's sqrt throughput capability by doing 1 vector at a time horizontally, instead of doing 4 lengths for 4 vectors in parallel.
Upvotes: 8