Reputation: 59
I have compiled a piece of code with the option -xSSE4.1 using the Intel compiler. When I looked at the generated assembly file, I see that AVX instructions such as 'vpmovzxbw' have been inserted. But, the executable still seems to run on machines that don't support the AVX instruction set. What explains this?
Here's the particular code snippet -
C -> src0_8x16b = _mm_cvtepu8_epi16 (src0_8x16b);
Assembly -> vpmovzxbw xmm4, QWORD PTR [rcx]
Binary -> 00066 c4 62 79 30 29
Here's another snippet where the assembly instruction uses 3 operands -
C -> src0_8x16b = _mm_sub_epi16 (src0_8x16b, src1_8x16b);
Assembly -> vpsubw xmm1, xmm13, xmm11
Binary -> 000bc c4 c1 11 f9 cb
For comparison, here's the disassembly generated by icc for the function 'foo' (The only difference between the function foo and the code snippet above is that the code snippet was coded using intrinsics) -
Compiler commands used -
icc -S -xSSE4.1 -axavx -O3 foo.c
Function foo -
void foo(float *x, int n)
{
int i;
for(i=0; i<n; i++) x[i] *= 2.0;
}
Autodispatch code -
testl $-131072, __intel_cpu_indicator(%rip) #1.27
jne foo.R #1.27
testl $-1, __intel_cpu_indicator(%rip) #1.27
jne foo.A
Loop in foo.R (AVX variant) -
vmulps (%rdi,%rcx,4), %ymm0, %ymm1 #3.24
vmulps 32(%rdi,%rcx,4), %ymm0, %ymm2 #3.24
vmovups %ymm1, (%rdi,%rcx,4) #3.24
vmovups %ymm2, 32(%rdi,%rcx,4) #3.24
addq $16, %rcx #3.5
cmpq %rdx, %rcx #3.5
jb ..B2.12 # Prob 82% #3.5
Loop in foo.A (SSE variant) -
movaps (%rdi,%r8,4), %xmm1 #3.24
movaps 16(%rdi,%r8,4), %xmm2 #3.24
mulps %xmm0, %xmm1 #3.24
mulps %xmm0, %xmm2 #3.24
movaps %xmm1, (%rdi,%r8,4) #3.24
movaps %xmm2, 16(%rdi,%r8,4) #3.24
addq $8, %r8 #3.5
cmpq %rsi, %r8 #3.5
jb ..B3.12 # Prob 82% #3.5
Upvotes: 1
Views: 521
Reputation: 59
I have tried to replicate the results on two other compilers, viz., gcc and Microsoft Visual Studio's v100 compilers. I was unable to do so, i.e., gcc and v100 compilers seem to be generating the correct disassemblies. As a further step, I looked closely at the differences, if any, that existed between the compiler arguments that I had specified in each case. It turns out that whilst using the icc compiler, I had enabled the option to inherit project defaults for compiling this particular file. The project settings were configured such that this option was included -
-xavx
As a result when this file was being compiled, the settings I had provided -
-xSSE4.1 -axavx
were overridden by the former. This was the cause of the behavior I have detailed in my question.
I am sorry for this error, but I shall not delete this question since @Zboson 's answer is exceptional.
PS - I had mentioned in one of my comments that I was able to run this code on an SSE42 machine. That was because the exe I had run on that machine was indeed SSE41 compliant since I had apparently used an exe generated using the gcc compiler. I ran the icc generated exe and it was indeed crashing with an illegal instruction error on the SSE42 machine.
Upvotes: 3
Reputation: 33679
The Intel compiler can
generate a single executable with multiple levels of vectorization with the -ax flag,
For example to generate code which is compatible with AVX, SSE4.1 and SSE2 to use -axAVX -axSSE4.2 -xSSE2
.
Since you compiled with -axAVX -xSSE4.1
Intel generated a AVX branch and a SSE4.1 branch and at runtime it determines which instruct set is available and chooses that.
Agner Fog has a good description of Intel's CPU dispatcher in his Optimizing C++ manaul. See section "13.7 CPU dispatching in Intel compiler". Intel's CPU dispatcher is not ideal for several reasons, one of which is that it plays bad on AMD, which Agner describes in detail. Personally I would make my own dispatcher.
I compiled the following code with ICC 13.0 with options -O3 -axavx -xsse2
void foo(float *x, int n) {
for(int i=0; i<n; i++) x[i] *= 2.0;
}
and the start of the assembly is
test DWORD PTR __intel_cpu_indicator[rip], -131072 #1.27
jne _Z3fooPfi.R #1.27
test DWORD PTR __intel_cpu_indicator[rip], -1 #1.27
jne _Z3fooPfi.A
going to the _Z3fooPfi.R
branch find the main AVX loop
..B2.12: # Preds ..B2.12 ..B2.11
vmulps ymm1, ymm0, YMMWORD PTR [rdi+rcx*4] #2.25
vmulps ymm2, ymm0, YMMWORD PTR [32+rdi+rcx*4] #2.25
vmovups YMMWORD PTR [rdi+rcx*4], ymm1 #2.25
vmovups YMMWORD PTR [32+rdi+rcx*4], ymm2 #2.25
add rcx, 16 #2.2
cmp rcx, rdx #2.2
jb ..B2.12 # Prob 82% #2.2
going to the _Z3fooPfi.A
branch has the main SSE loop
movaps xmm1, XMMWORD PTR [rdi+r8*4] #2.25
movaps xmm2, XMMWORD PTR [16+rdi+r8*4] #2.25
mulps xmm1, xmm0 #2.25
mulps xmm2, xmm0 #2.25
movaps XMMWORD PTR [rdi+r8*4], xmm1 #2.25
movaps XMMWORD PTR [16+rdi+r8*4], xmm2 #2.25
add r8, 8 #2.2
cmp r8, rsi #2.2
jb ..B3.12 # Prob 82% #2.2
Upvotes: 3