Reputation: 5363
Here's the test program that is giving me the grief:
#include <xmmintrin.h>
#include <stdio.h>
inline float
_mm_hadd_ps(const __m128 v)
{
const __m128
x = _mm_add_ps(v, _mm_movehl_ps(v, v)),
xx = _mm_add_ss(x, _mm_shuffle_ps(x, x, 1));
float __attribute__((aligned(16))) s;
_mm_store_ss(&s, xx);
return s;
}
int
main(void)
{
const float __attribute__((aligned(16))) d[] = {
4.0763611794e+00, 1.1881252751e-02, 4.9195003510e+00, 0.0000000000e+00
};
const float x = _mm_hadd_ps(_mm_load_ps(d));
const float y = d[0] + d[1] + d[2] + d[3];
printf("diff: %.10f\n", x - y);
return 0;
}
I'm compiling it with the following:
gcc -Wall -msse2 -mfpmath=sse -O0 -g -ggdb sse.c
The output that I get from is:
diff: -0.0000009537
I'm aware of the issues with extended precision arithmetic, hence mfpmath=sse
. Looking at the assembly code, the serial addition is indeed done with addss
and final subtraction with subss
.
At this point I'm stumped to explain where that difference is coming from. I'd much appreciate it if anyone can shed some light on this situation.
If it makes any difference, I'm using GCC 4.3.4. (Edit: On AMD Opteron 2218 + Gentoo Linux)
Upvotes: 2
Views: 154
Reputation: 213170
FWIW both gcc 4.2 and Intel ICC 11.1 give exactly the same result. I suspect it's just a difference in accumulated rounding errors due to the different order in which the additions are performed.
Upvotes: 1