Reputation: 8530
I'm using Intel SSE/AVX/FMA intrinsics to achieve perfectly inlining SSE/AVX instructions for some math functions.
Given the following code
#include <cmath>
#include <immintrin.h>
auto std_fma(float x, float y, float z)
{
return std::fma(x, y, z);
}
float _fma(float x, float y, float z)
{
_mm_store_ss(&x,
_mm_fmadd_ss(_mm_load_ss(&x), _mm_load_ss(&y), _mm_load_ss(&z))
);
return x;
}
float _sqrt(float x)
{
_mm_store_ss(&x,
_mm_sqrt_ss(_mm_load_ss(&x))
);
return x;
}
the clang 3.9 generated assembly with -march=x86-64 -mfma -O3
std_fma(float, float, float): # @std_fma(float, float, float)
vfmadd213ss xmm0, xmm1, xmm2
ret
_fma(float, float, float): # @_fma(float, float, float)
vxorps xmm3, xmm3, xmm3
vmovss xmm0, xmm3, xmm0 # xmm0 = xmm0[0],xmm3[1,2,3]
vmovss xmm1, xmm3, xmm1 # xmm1 = xmm1[0],xmm3[1,2,3]
vmovss xmm2, xmm3, xmm2 # xmm2 = xmm2[0],xmm3[1,2,3]
vfmadd213ss xmm0, xmm1, xmm2
ret
_sqrt(float): # @_sqrt(float)
vsqrtss xmm0, xmm0, xmm0
ret
while the generated code for _sqrt
is fine, there are unnecessary vxorps
(which sets the absolutely unused xmm3 register to zero) and movss
instructions in _fma
compared to std_fma
(which rely on compiler intrinsic std::fma)
the GCC 6.2 generated assembly with -march=x86-64 -mfma -O3
std_fma(float, float, float):
vfmadd132ss xmm0, xmm2, xmm1
ret
_fma(float, float, float):
vinsertps xmm1, xmm1, xmm1, 0xe
vinsertps xmm2, xmm2, xmm2, 0xe
vinsertps xmm0, xmm0, xmm0, 0xe
vfmadd132ss xmm0, xmm2, xmm1
ret
_sqrt(float):
vinsertps xmm0, xmm0, xmm0, 0xe
vsqrtss xmm0, xmm0, xmm0
ret
and here are a lot of unnecessary vinsertps
instructions
Working example: https://godbolt.org/g/q1BQym
The default x64 calling convention pass floating-point function arguments in XMM registers, so those vmovss
and vinsertps
instructions should be eliminated. Why do the mentioned compilers still emit them? Is it possible to get rid of them without inline assembly?
I also tried to use _mm_cvtss_f32
instead of _mm_store_ss
and multiple calling conventions, but nothing changed.
Upvotes: 4
Views: 1318
Reputation: 8530
I write this answer based on the comments, some discussion and my own experiences.
As Ross Ridge pointed out in the comments, the compiler is not smart enough to recognize that only the lowest floating-point element of the XMM register is used, so it do zero out the other three elements with those vxorps
vinsertps
instructions. This is absolutely unnecessary, but what can you do?
Need to note that clang 3.9 does much better job than GCC 6.2 (or current snapshot of 7.0) at generating assembly for Intel intrinsics, since it only fails at _mm_fmadd_ss
in my example. I tested more intrinsics as well and in most cases clang did perfect job to emit single instructions.
What can you do
You can use the standard <cmath>
functions, with the hope that they are defined as compiler intrinsics if a proper CPU instructions is available.
This is not enough
Compilers, like GCC implement these functions with special handling of NaN and infinities. So in addition to the intrinsics, they can do some comparison, branching, and possible errno
flag handling.
Compiler flags -fno-math-errno
-fno-trapping-math
do help GCC and clang to eliminate the additional floating-point special cases and errno
handling, so they can emit single instructions if possible: https://godbolt.org/g/LZJyaB.
You can achieve the same with -ffast-math
, since it also includes the above flags, but it includes much more than that, and those (like unsafe math optimizations) are probably not desired.
Unfortunately this is not a portable solution. It works in most cases (see the godbolt link), but still, you depend on the implementation.
What more
You can yet use inline assembly, which is also not portable, much more tricky and there are much more things to consider. In spite of that, for such simple one-line instructions it can be okay.
Things to consider:
1st GCC/clang and Visual Studio use different syntax for inline assembly, and Visual Studio doesn't allow it in x64 mode.
2nd You need to emit VEX encoded instructions (3 op variants, e.g. vsqrtss xmm0 xmm1 xmm2
) for AVX targets, and non-VEX encoded (2 op variants, e.g. sqrtss xmm0 xmm1
) variants for pre-AVX CPUs. VEX encoded instructions are 3 operand instructions, so they offer more freedom for the compiler to optimize. To take their advantage, register input/output parameters must be set properly. So something like below does the job.
# if __AVX__
asm("vsqrtss %1, %1, %0" :"=x"(x) : "x"(x));
# else
asm("sqrtss %1, %0" :"=x"(x) : "x"(x));
# endif
But the following is a bad technique for VEX:
asm("vsqrtss %1, %1, %0" :"+x"(x));
It can yield to an unnecessary move instruction, check https://godbolt.org/g/VtNMLL.
3rd As Peter Cordes pointed out, you can lose common subexpression elimination (CSE) and constant folding (constant propagation) for inline assembly functions. However if the inline asm is not declared as volatile
, the compiler can treat it as a pure function which depends only on its inputs and perform common subexpression elimination, which is great.
As Peter said:
"Don't use inline asm" isn't an absolute rule, it's just something you should be aware of and consider carefully before using. If the alternatives don't meet your requirements, and you don't end up with this inlining into places where it can't optimize, then go right ahead.
Upvotes: 3