Reputation: 21223
Consider this simple loop:
float f(float x[]) {
float p = 1.0;
for (int i = 0; i < 128; i++)
p += x[i];
return p;
}
If you compile it with -O2 -march=haswell in gcc you get:
f:
vmovss xmm0, DWORD PTR .LC0[rip]
lea rax, [rdi+512]
.L2:
vaddss xmm0, xmm0, DWORD PTR [rdi]
add rdi, 4
cmp rdi, rax
jne .L2
ret
.LC0:
.long 1065353216
However, the Intel C Compiler gives:
f:
xor eax, eax #3.3
pxor xmm0, xmm0 #2.11
movaps xmm7, xmm0 #2.11
movaps xmm6, xmm0 #2.11
movaps xmm5, xmm0 #2.11
movaps xmm4, xmm0 #2.11
movaps xmm3, xmm0 #2.11
movaps xmm2, xmm0 #2.11
movaps xmm1, xmm0 #2.11
..B1.2: # Preds ..B1.2 ..B1.1
movups xmm8, XMMWORD PTR [rdi+rax*4] #4.10
movups xmm9, XMMWORD PTR [16+rdi+rax*4] #4.10
movups xmm10, XMMWORD PTR [32+rdi+rax*4] #4.10
movups xmm11, XMMWORD PTR [48+rdi+rax*4] #4.10
movups xmm12, XMMWORD PTR [64+rdi+rax*4] #4.10
movups xmm13, XMMWORD PTR [80+rdi+rax*4] #4.10
movups xmm14, XMMWORD PTR [96+rdi+rax*4] #4.10
movups xmm15, XMMWORD PTR [112+rdi+rax*4] #4.10
addps xmm0, xmm8 #4.5
addps xmm7, xmm9 #4.5
addps xmm6, xmm10 #4.5
addps xmm5, xmm11 #4.5
addps xmm4, xmm12 #4.5
addps xmm3, xmm13 #4.5
addps xmm2, xmm14 #4.5
addps xmm1, xmm15 #4.5
add rax, 32 #3.3
cmp rax, 128 #3.3
jb ..B1.2 # Prob 99% #3.3
addps xmm0, xmm7 #2.11
addps xmm6, xmm5 #2.11
addps xmm4, xmm3 #2.11
addps xmm2, xmm1 #2.11
addps xmm0, xmm6 #2.11
addps xmm4, xmm2 #2.11
addps xmm0, xmm4 #2.11
movaps xmm1, xmm0 #2.11
movhlps xmm1, xmm0 #2.11
addps xmm0, xmm1 #2.11
movaps xmm2, xmm0 #2.11
shufps xmm2, xmm0, 245 #2.11
addss xmm0, xmm2 #2.11
addss xmm0, DWORD PTR .L_2il0floatpacket.0[rip] #2.11
ret #5.10
.L_2il0floatpacket.0:
.long 0x3f800000
If we ignore the loop unrolling, the most obvious difference is that gcc using vaddss and icc uses addss.
Is there a performance difference between these two pieces of assembly and which one is better (ignoring the loop unrolling)?
The v prefix comes from the VEX coding scheme. It seems you can get icc to use these instructions by added -xavx
as part of the command line flags. However, the question remains if there is any performance difference between the two sets of assembly in the question or if there is any advantage of one over the other.
Upvotes: 1
Views: 341
Reputation: 93014
The instructions with mnemonics prefixed v
are VEX encoded instructions. The VEX encoding scheme allows for the encoding of every SSE instruction as well as the new AVX instructions and some other instructions. There is an almost 1:1 correspondence between legacy instructions and VEX encoded instructions with the following differences:
ymm
register corresponding to the xmm
register operand used in the instruction. This avoids a costly partial register update if a previous instruction left data in these bits.Upvotes: 5