Strange gcc6.1 -O2 compiling behaviour

Question

I am compiling the same benchmark using gcc -O2 -march=native flags. However, Interesting thing is when I look at the objdump, it actually produce some instructions like vxorpd, etc, which I think should only appear when -ftree-vectorize is enabled (and -O2 should not enable this by default?) If I add -m32 flag to compile in 32 bit instruction, these packed instructions disappeared. Anyone met similar situations could give some explanations? Thanks.

Cody Gray · Accepted Answer

XORPD is the classic SSE2 instruction that performs a bitwise logical XOR on two packed double-precision floating-point values.

VXORPD is the vector version of that same instruction. Essentially, it is the classic SSE2 XORPD instruction with a VEX prefix. That's what the "V" prefix means in the opcode. It was introduced with AVX (Advanced Vector Extensions), and is supported on any architecture that supports AVX. (There are actually two versions, the VEX.128-encoded version that works on 128-bit AVX registers, and the VEX.256-encoded version that works on 256-bit AVX2 registers.)

All of the legacy SSE and SSE2 instructions can have a VEX prefix added to them, giving them a three-operand form and allowing them to interact and schedule more efficiently with the other new AVX instructions. It also avoids the high cost of transitions between VEX and non-VEX modes. Otherwise, these new encodings retain identical behavior. As such, compilers will typically generate VEX-prefixed versions of these instructions whenever the target architecture supports them. Clearly, in your case, march=native is specifying an architecture that supports, at a minimum, AVX.

On GCC and Clang, you will actually get these instructions emitted even with optimization turned off (-O0), so you will certainly get them when optimizations are enabled. Neither the -ftree-vectorize switch, nor any of the other vectorization-specific optimization switches need to be on because this doesn't actually have anything to do with vectorizing your code. More precisely, the code flow hasn't changed, just the encoding of the instructions.

You can see this with the simplest code imaginable:

double Foo()
{
   return 0.0;
}

Foo():
        vxorpd  xmm0, xmm0, xmm0
        ret

So that explains why you're seeing VXORPD and its friends when you compile a 64-bit build with the -march=native switch.

That leaves the question of why you don't see it when you throw the -m32 switch (which means to generate code for 32-bit platforms). SSE and AVX instructions are still available when targeting these platforms, and I believe they will be used under certain circumstances, but they cannot be used quite as frequently because of significant differences in the 32-bit ABI. Specifically, the 32-bit ABI requires that floating-point values be returned on the x87 floating point stack. Since that requires the use of the x87 floating point instructions, the optimizer tends to stick with those unless it is heavily vectorizing a section of code. That's the only time it really makes sense to shuffle values from the x87 stack to SIMD registers and back again. Otherwise, that's a performance drain for little to no practical benefit.

You can see this too in action. Look at what changes in the output just by throwing the -m32 switch:

Foo():
        fldz
        ret

FLDZ is the x87 FPU instruction for loading the constant zero at the top of the floating-point stack, where it is ready to be returned to the caller.

Obviously, as you make the code more complicated, you are more likely to change the optimizer's heuristics and persuade it to emit SIMD instructions. You are far more likely still if you enable vectorization-based optimizations.

Strange gcc6.1 -O2 compiling behaviour

Answers (2)

Related Questions