Reputation: 59
I am compiling the same benchmark using gcc -O2 -march=native
flags. However, Interesting thing is when I look at the objdump
, it actually produce some instructions like vxorpd
, etc, which I think should only appear when -ftree-vectorize
is enabled (and -O2
should not enable this by default?) If I add -m32
flag to compile in 32 bit instruction, these packed instructions disappeared. Anyone met similar situations could give some explanations? Thanks.
Upvotes: 1
Views: 208
Reputation: 355
Just to add to Cody Gray's very good answer, you may check gcc's internally enabled options by outputting to assembler and turning on -fverbose-asm
.
For example:
gcc -O2 -fverbose-asm -S -o test.S test.c
will list in test.S
all optimization options enabled at the chosen optimization level (here -O2
).
Upvotes: 1
Reputation: 245001
XORPD
is the classic SSE2 instruction that performs a bitwise logical XOR on two packed double-precision floating-point values.
VXORPD
is the vector version of that same instruction. Essentially, it is the classic SSE2 XORPD
instruction with a VEX prefix. That's what the "V" prefix means in the opcode. It was introduced with AVX (Advanced Vector Extensions), and is supported on any architecture that supports AVX. (There are actually two versions, the VEX.128-encoded version that works on 128-bit AVX registers, and the VEX.256-encoded version that works on 256-bit AVX2 registers.)
All of the legacy SSE and SSE2 instructions can have a VEX prefix added to them, giving them a three-operand form and allowing them to interact and schedule more efficiently with the other new AVX instructions. It also avoids the high cost of transitions between VEX and non-VEX modes. Otherwise, these new encodings retain identical behavior. As such, compilers will typically generate VEX-prefixed versions of these instructions whenever the target architecture supports them. Clearly, in your case, march=native
is specifying an architecture that supports, at a minimum, AVX.
On GCC and Clang, you will actually get these instructions emitted even with optimization turned off (-O0
), so you will certainly get them when optimizations are enabled. Neither the -ftree-vectorize
switch, nor any of the other vectorization-specific optimization switches need to be on because this doesn't actually have anything to do with vectorizing your code. More precisely, the code flow hasn't changed, just the encoding of the instructions.
You can see this with the simplest code imaginable:
double Foo()
{
return 0.0;
}
Foo():
vxorpd xmm0, xmm0, xmm0
ret
So that explains why you're seeing VXORPD
and its friends when you compile a 64-bit build with the -march=native
switch.
That leaves the question of why you don't see it when you throw the -m32
switch (which means to generate code for 32-bit platforms). SSE and AVX instructions are still available when targeting these platforms, and I believe they will be used under certain circumstances, but they cannot be used quite as frequently because of significant differences in the 32-bit ABI. Specifically, the 32-bit ABI requires that floating-point values be returned on the x87 floating point stack. Since that requires the use of the x87 floating point instructions, the optimizer tends to stick with those unless it is heavily vectorizing a section of code. That's the only time it really makes sense to shuffle values from the x87 stack to SIMD registers and back again. Otherwise, that's a performance drain for little to no practical benefit.
You can see this too in action. Look at what changes in the output just by throwing the -m32
switch:
Foo():
fldz
ret
FLDZ
is the x87 FPU instruction for loading the constant zero at the top of the floating-point stack, where it is ready to be returned to the caller.
Obviously, as you make the code more complicated, you are more likely to change the optimizer's heuristics and persuade it to emit SIMD instructions. You are far more likely still if you enable vectorization-based optimizations.
Upvotes: 3