Different intrinsics behaviour depending on GCC version

Question

I'm pretty new to intrinsics and i faced with different behavior of my code with GCC-7.4 and GCC-8.3

My code is pretty simple

b.cpp:

#include 
#include 

void foo(const float num, const float denom)
{
    const __v4sf num4 = {
        num,
        num,
        num,
        num,
    };
    const __v4sf denom4 = {
        denom,
        denom,
        denom,
        denom,
    };
    float res_arr[] = {0, 0, 0, 0};

    __v4sf *res = (__v4sf*)res_arr;
    *res = num4 / denom4;
    std::cout << res_arr[0] << std::endl;
    std::cout << res_arr[1] << std::endl;
    std::cout << res_arr[2] << std::endl;
    std::cout << res_arr[3] << std::endl;
}

In b.cpp we just basically construct two __v4sf from float variables and performing division

b.h:

#ifndef B_H
#define B_H

void foo(const float num, const float denom);

#endif

a.cpp:

#include "b.h"

int main (void)
{
    const float denominator = 1.0f;
    const float numerator = 12.0f;
    foo(numerator, denominator);
    return 0;
}

Here we just call our function from b.cpp

GCC 7.4 works ok:

g++-7 -c b.cpp -o b.o && g++-7 a.cpp b.o -o a.out && ./a.out
12
12
12
12

But something wrong with GCC 8.3

g++-8 -c b.cpp -o b.o && g++-8 a.cpp b.o -o a.out && ./a.out
inf
inf
inf
inf

So my question is - why i receive different results with different versions of GCC? Is it undefined behavior?

Peter Cordes · Accepted Answer

You've found a bug in gcc8 and later, which happens with/without optimization enabled. Thanks for reporting it.

With optimization enabled it's easy to see what the asm is doing because the __v4sf stuff optimizes away: it's just scalar division and printing the result 4 times. (Plus 4 calls to flush cout because you used std::endl for some reason.)

gcc7 correctly optimizes it to divss xmm0, xmm1 to do num / denom. Then it converts to double because the output functions only take double, not float, passes that to iostream functions. (GCC7 saves the double bit-pattern in integer register r14 instead of memory, with -mtune=skylake. GCC8 and later just use memory which probably makes more sense.)

gcc8 and later does divss xmm0, .LC0[rip] where the constant from memory is 0 (the bit-pattern for +0.0). So it's dividing the num by zero, ignoring denom.

Check it out on the Godbolt compiler explorer.

Using alignas(16) float res_arr[4]; to remove the potential under-alignment of the __v4sf *res doesn't help. (You generally don't need __attribute__((aligned(16))) anymore; C++11 introduced standard syntax for alignment.)

Different intrinsics behaviour depending on GCC version

Answers (1)

Related Questions