Reputation: 631
Hi I'm trying to build without any avx512 instructions by using those flags:
-march=native -mno-avx512f
.
However i still get a binary which has
AVX512 (vmovss
) instruction generated (i'm using elfx86exts to check).
Any idea how to disable those ?
Upvotes: 2
Views: 2977
Reputation: 631
I found an error in my use-case .. One of the compiled units was dependant on openvino SDK which added -mavx512f flag explicitly.
Upvotes: 2
Reputation: 365677
-march=native -mno-avx512f
is the correct option, vmovss
only requires AVX1.
There is an AVX512F EVEX encoding of vmovss
, but GAS won't use it unless the register involved is xmm16..31
. GCC won't emit asm using those registers when you disable AVX512F with -mno-avx512f
, or don't enable it in the first place with something like -march=skylake
or -march=znver2
.
If you're still not sure, check the actual disassembly + machine code to see what prefix the instruction starts with:
C5
or C4
byte: start of a 2 or 3 byte VEX prefix, AVX1 encoding.62
byte: start of an EVEX prefix, AVX512F encoding.intel_syntax noprefix
vmovss xmm15, [rdi]
vmovss xmm15, [r11]
vmovss xmm16, [rdi]
assembled with gcc -c avx.s
and disassemble with objdump -drwC -Mintel avx.o
:
0000000000000000 <.text>:
0: c5 7a 10 3f vmovss xmm15,DWORD PTR [rdi] # AVX1
4: c4 41 7a 10 3b vmovss xmm15,DWORD PTR [r11] # AVX1
9: 62 e1 7e 08 10 07 vmovss xmm16,DWORD PTR [rdi] # AVX512F
2 and 3 byte VEX, and 4 byte EVEX prefixes before the 10
opcode. (The ModRM bytes are different too; xmm0 and xmm16 would differ only in the extra register bit from the prefix, not the modrm).
GAS uses the AVX1 VEX encoding of vmovss
and other instructions when possible. So you can count on instructions that have a non-AVX512F form to be using the non-AVX512F form whenever possible. This is how the GNU toolchain (used by GCC) makes -mno-avx512f
work.
This applies even when the EVEX encoding is shorter. e.g. when a [reg + constant]
could use an AVX512 scaled disp8 (scaled by the element width) but the AVX1 encoding would need a 32-bit displacement that counts in bytes.
f: c5 7a 10 bf 00 01 00 00 vmovss xmm15,DWORD PTR [rdi+0x100] # AVX1 [reg+disp32]
17: 62 e1 7e 08 10 47 40 vmovss xmm16,DWORD PTR [rdi+0x100] # AVX512 [reg + disp8*4]
1e: c5 78 28 bf 00 01 00 00 vmovaps xmm15,XMMWORD PTR [rdi+0x100] # AVX1 [reg+disp32]
26: 62 e1 7c 08 28 47 10 vmovaps xmm16,XMMWORD PTR [rdi+0x100] # AVX512 [reg + disp8*16]
Note the last byte, or last 4 bytes, of the machine code encodings: it's a 32-bit little-endian 0x100 byte displacement for the AVX1 encodings, but an 8-bit displacement of 0x40 dwords or 0x10 dqwords for the AVX512 encodings.
But using an asm-source override of {evex} vmovaps xmm0, [rdi+256]
we can get the compact encoding even for "low" registers:
62 f1 7c 08 28 47 10 vmovaps xmm0,XMMWORD PTR [rdi+0x100]
GCC will of course not do that with -mno-avx512f
.
Unfortunately GCC and clang also miss that optimization when you do enable AVX512F, e.g. when compiling __m128 load(__m128 *p){ return p[16]; }
with -O3 -march=skylake-avx512
(Godbolt). Use binary mode, or simply note the lack of an {evex}
tag on that asm source line of compiler output.
Upvotes: 8