Reputation: 294
I had a C code written with some intel-intrinsincs. After I compiled it first with avx and then with ssse3 flags, I got two quite different assembly codes. E.g:
AVX:
vpunpckhbw %xmm0, %xmm1, %xmm2
SSSE3:
movdqa %xmm0, %xmm2
punpckhbw %xmm1, %xmm2
It's clear that vpunpckhbw
is just punpckhbw
but using the avx three operand syntax. But is the latency and the throughput of the first instruction equivalent to the latency and the throughput of the last ones combined?
Or does the answer depend on the architecture I'm using? It's IntelCore i5-6500 by the way.
I tried to search for an answer in Agner Fog's instruction tables but couldn't find the answer. Intel specifications also didn't help (however, it's likely that I just missed the one I needed).
Is it always better to use new AVX syntax if possible?
Upvotes: 4
Views: 666
Reputation: 33679
Is it always better to use new AVX syntax if possible?
I think the first question is to ask if folder instructions are better than a non-folder instruction pair. Folding takes a pair of read and modify instructions like this
vmovdqa %xmm0, %xmm2
vpunpckhbw %xmm2, %xmm1, %xmm1
and "folds" them into one combined instruction
vpunpckhbw %xmm0, %xmm1, %xmm2
Since Ivy Bridge a register to register move instruction can have zero latency and can use zero execution ports. However, the unfolded instruction pair still counts as two instructions on the front-end and therefore can affect the overall throughput. The folded instruction however only counts as one instruction in the front-end which lowers the pressure on the front-end without any side effects. This could increase the overall throughput.
However, for memory to register moves the folding can may have a side effect (there is currently some debate about this) even if it lowers pressure on the front-end. The reason is that the out-of-order engine from the front-ends point of view only sees a folded instruction (assuming this answer is correct) and if for some reason it would be more optimal to reorder the memory read operation (since it does require execution ports and has latency) independently from the other operations in the folded instruction the out-of-order engine won't be able to take advantage of this. I observed this for the first time here.
For your particular operation the AVX syntax is always better since it folds the register to register move. However, if you had a memory to register move the folder AVX instruction could perform worse than the unfolded SSE instruction pair in some cases.
Note that, in general, it should still be better to use a vex-encoded instructions. But I think most compilers, if not all, now assume folding is always better so you have no way to control the folding except with assembly (not even with intrinsics) or in some cases by telling the compiler not to compile with AVX.
Upvotes: 5