Reputation: 1508
I'm pretty new to intrinsics and i faced with different behavior of my code with GCC-7.4 and GCC-8.3
My code is pretty simple
b.cpp:
#include <iostream>
#include <xmmintrin.h>
void foo(const float num, const float denom)
{
const __v4sf num4 = {
num,
num,
num,
num,
};
const __v4sf denom4 = {
denom,
denom,
denom,
denom,
};
float res_arr[] = {0, 0, 0, 0};
__v4sf *res = (__v4sf*)res_arr;
*res = num4 / denom4;
std::cout << res_arr[0] << std::endl;
std::cout << res_arr[1] << std::endl;
std::cout << res_arr[2] << std::endl;
std::cout << res_arr[3] << std::endl;
}
In b.cpp we just basically construct two __v4sf
from float variables and performing division
b.h:
#ifndef B_H
#define B_H
void foo(const float num, const float denom);
#endif
a.cpp:
#include "b.h"
int main (void)
{
const float denominator = 1.0f;
const float numerator = 12.0f;
foo(numerator, denominator);
return 0;
}
Here we just call our function from b.cpp
GCC 7.4 works ok:
g++-7 -c b.cpp -o b.o && g++-7 a.cpp b.o -o a.out && ./a.out
12
12
12
12
But something wrong with GCC 8.3
g++-8 -c b.cpp -o b.o && g++-8 a.cpp b.o -o a.out && ./a.out
inf
inf
inf
inf
So my question is - why i receive different results with different versions of GCC? Is it undefined behavior?
Upvotes: 6
Views: 625
Reputation: 364438
You've found a bug in gcc8 and later, which happens with/without optimization enabled. Thanks for reporting it.
With optimization enabled it's easy to see what the asm is doing because the __v4sf
stuff optimizes away: it's just scalar division and printing the result 4 times. (Plus 4 calls to flush cout because you used std::endl
for some reason.)
gcc7 correctly optimizes it to divss xmm0, xmm1
to do num / denom
. Then it converts to double
because the output functions only take double
, not float
, passes that to iostream
functions. (GCC7 saves the double
bit-pattern in integer register r14
instead of memory, with -mtune=skylake
. GCC8 and later just use memory which probably makes more sense.)
gcc8 and later does divss xmm0, .LC0[rip]
where the constant from memory is 0
(the bit-pattern for +0.0
). So it's dividing the num
by zero, ignoring denom
.
Check it out on the Godbolt compiler explorer.
Using alignas(16) float res_arr[4];
to remove the potential under-alignment of the __v4sf *res
doesn't help. (You generally don't need __attribute__((aligned(16)))
anymore; C++11 introduced standard syntax for alignment.)
Upvotes: 4