Issues of compiler generated assembly for intrinsics

Question

I'm using Intel SSE/AVX/FMA intrinsics to achieve perfectly inlining SSE/AVX instructions for some math functions.

Given the following code

#include 
#include 

auto std_fma(float x, float y, float z)
{
    return std::fma(x, y, z);
}

float _fma(float x, float y, float z)
{
    _mm_store_ss(&x,
        _mm_fmadd_ss(_mm_load_ss(&x), _mm_load_ss(&y), _mm_load_ss(&z))
    );

    return x;
}

float _sqrt(float x)
{
    _mm_store_ss(&x,
        _mm_sqrt_ss(_mm_load_ss(&x))
    );

    return x;
}

the clang 3.9 generated assembly with -march=x86-64 -mfma -O3

std_fma(float, float, float):                          # @std_fma(float, float, float)
        vfmadd213ss     xmm0, xmm1, xmm2
        ret

_fma(float, float, float):                             # @_fma(float, float, float)
        vxorps  xmm3, xmm3, xmm3
        vmovss  xmm0, xmm3, xmm0        # xmm0 = xmm0[0],xmm3[1,2,3]
        vmovss  xmm1, xmm3, xmm1        # xmm1 = xmm1[0],xmm3[1,2,3]
        vmovss  xmm2, xmm3, xmm2        # xmm2 = xmm2[0],xmm3[1,2,3]
        vfmadd213ss     xmm0, xmm1, xmm2
        ret

_sqrt(float):                              # @_sqrt(float)
        vsqrtss xmm0, xmm0, xmm0
        ret

while the generated code for _sqrt is fine, there are unnecessary vxorps (which sets the absolutely unused xmm3 register to zero) and movss instructions in _fma compared to std_fma (which rely on compiler intrinsic std::fma)

the GCC 6.2 generated assembly with -march=x86-64 -mfma -O3

std_fma(float, float, float):
        vfmadd132ss     xmm0, xmm2, xmm1
        ret
_fma(float, float, float):
        vinsertps       xmm1, xmm1, xmm1, 0xe
        vinsertps       xmm2, xmm2, xmm2, 0xe
        vinsertps       xmm0, xmm0, xmm0, 0xe
        vfmadd132ss     xmm0, xmm2, xmm1
        ret
_sqrt(float):
        vinsertps       xmm0, xmm0, xmm0, 0xe
        vsqrtss xmm0, xmm0, xmm0
        ret

and here are a lot of unnecessary vinsertps instructions

Working example: https://godbolt.org/g/q1BQym

The default x64 calling convention pass floating-point function arguments in XMM registers, so those vmovss and vinsertps instructions should be eliminated. Why do the mentioned compilers still emit them? Is it possible to get rid of them without inline assembly?

I also tried to use _mm_cvtss_f32 instead of _mm_store_ss and multiple calling conventions, but nothing changed.

plasmacel · Accepted Answer

I write this answer based on the comments, some discussion and my own experiences.

As Ross Ridge pointed out in the comments, the compiler is not smart enough to recognize that only the lowest floating-point element of the XMM register is used, so it do zero out the other three elements with those vxorps vinsertps instructions. This is absolutely unnecessary, but what can you do?

Need to note that clang 3.9 does much better job than GCC 6.2 (or current snapshot of 7.0) at generating assembly for Intel intrinsics, since it only fails at _mm_fmadd_ss in my example. I tested more intrinsics as well and in most cases clang did perfect job to emit single instructions.

What can you do

You can use the standard functions, with the hope that they are defined as compiler intrinsics if a proper CPU instructions is available.

This is not enough

Compilers, like GCC implement these functions with special handling of NaN and infinities. So in addition to the intrinsics, they can do some comparison, branching, and possible errno flag handling.

Compiler flags -fno-math-errno -fno-trapping-math do help GCC and clang to eliminate the additional floating-point special cases and errno handling, so they can emit single instructions if possible: https://godbolt.org/g/LZJyaB.

You can achieve the same with -ffast-math, since it also includes the above flags, but it includes much more than that, and those (like unsafe math optimizations) are probably not desired.

Unfortunately this is not a portable solution. It works in most cases (see the godbolt link), but still, you depend on the implementation.

What more

You can yet use inline assembly, which is also not portable, much more tricky and there are much more things to consider. In spite of that, for such simple one-line instructions it can be okay.

Things to consider:

1st GCC/clang and Visual Studio use different syntax for inline assembly, and Visual Studio doesn't allow it in x64 mode.

2nd You need to emit VEX encoded instructions (3 op variants, e.g. vsqrtss xmm0 xmm1 xmm2) for AVX targets, and non-VEX encoded (2 op variants, e.g. sqrtss xmm0 xmm1) variants for pre-AVX CPUs. VEX encoded instructions are 3 operand instructions, so they offer more freedom for the compiler to optimize. To take their advantage, register input/output parameters must be set properly. So something like below does the job.

#   if __AVX__
    asm("vsqrtss %1, %1, %0" :"=x"(x) : "x"(x));
#   else
    asm("sqrtss %1, %0" :"=x"(x) : "x"(x));
#   endif

But the following is a bad technique for VEX:

asm("vsqrtss %1, %1, %0" :"+x"(x));

It can yield to an unnecessary move instruction, check https://godbolt.org/g/VtNMLL.

3rd As Peter Cordes pointed out, you can lose common subexpression elimination (CSE) and constant folding (constant propagation) for inline assembly functions. However if the inline asm is not declared as volatile, the compiler can treat it as a pure function which depends only on its inputs and perform common subexpression elimination, which is great.

As Peter said:

"Don't use inline asm" isn't an absolute rule, it's just something you should be aware of and consider carefully before using. If the alternatives don't meet your requirements, and you don't end up with this inlining into places where it can't optimize, then go right ahead.

Issues of compiler generated assembly for intrinsics

Answers (1)

Related Questions