ashwin
ashwin

Reputation: 59

AVX instructions generated when -xSSE4.1 specified

I have compiled a piece of code with the option -xSSE4.1 using the Intel compiler. When I looked at the generated assembly file, I see that AVX instructions such as 'vpmovzxbw' have been inserted. But, the executable still seems to run on machines that don't support the AVX instruction set. What explains this?

Here's the particular code snippet -

C -> src0_8x16b  = _mm_cvtepu8_epi16 (src0_8x16b);

Assembly -> vpmovzxbw xmm4, QWORD PTR [rcx]

Binary -> 00066 c4 62 79 30 29   

Here's another snippet where the assembly instruction uses 3 operands -

C -> src0_8x16b = _mm_sub_epi16 (src0_8x16b, src1_8x16b);

Assembly -> vpsubw xmm1, xmm13, xmm11              

Binary -> 000bc c4 c1 11 f9 cb   

For comparison, here's the disassembly generated by icc for the function 'foo' (The only difference between the function foo and the code snippet above is that the code snippet was coded using intrinsics) -

Compiler commands used - 
icc -S -xSSE4.1 -axavx -O3 foo.c

Function foo -
void foo(float *x, int n) 
{
    int i;

    for(i=0; i<n; i++) x[i] *= 2.0;
}

Autodispatch code - 
testl     $-131072, __intel_cpu_indicator(%rip)         #1.27
jne       foo.R                                         #1.27
testl     $-1, __intel_cpu_indicator(%rip)              #1.27
jne       foo.A

Loop in foo.R (AVX variant) - 
vmulps    (%rdi,%rcx,4), %ymm0, %ymm1                   #3.24
vmulps    32(%rdi,%rcx,4), %ymm0, %ymm2                 #3.24
vmovups   %ymm1, (%rdi,%rcx,4)                          #3.24
vmovups   %ymm2, 32(%rdi,%rcx,4)                        #3.24
addq      $16, %rcx                                     #3.5
cmpq      %rdx, %rcx                                    #3.5
jb        ..B2.12       # Prob 82%                      #3.5

Loop in foo.A (SSE variant) - 
movaps    (%rdi,%r8,4), %xmm1                           #3.24
movaps    16(%rdi,%r8,4), %xmm2                         #3.24
mulps     %xmm0, %xmm1                                  #3.24
mulps     %xmm0, %xmm2                                  #3.24
movaps    %xmm1, (%rdi,%r8,4)                           #3.24
movaps    %xmm2, 16(%rdi,%r8,4)                         #3.24
addq      $8, %r8                                       #3.5
cmpq      %rsi, %r8                                     #3.5
jb        ..B3.12       # Prob 82%                      #3.5

Upvotes: 1

Views: 521

Answers (2)

ashwin
ashwin

Reputation: 59

I have tried to replicate the results on two other compilers, viz., gcc and Microsoft Visual Studio's v100 compilers. I was unable to do so, i.e., gcc and v100 compilers seem to be generating the correct disassemblies. As a further step, I looked closely at the differences, if any, that existed between the compiler arguments that I had specified in each case. It turns out that whilst using the icc compiler, I had enabled the option to inherit project defaults for compiling this particular file. The project settings were configured such that this option was included -

-xavx

As a result when this file was being compiled, the settings I had provided -

-xSSE4.1 -axavx

were overridden by the former. This was the cause of the behavior I have detailed in my question.

I am sorry for this error, but I shall not delete this question since @Zboson 's answer is exceptional.

PS - I had mentioned in one of my comments that I was able to run this code on an SSE42 machine. That was because the exe I had run on that machine was indeed SSE41 compliant since I had apparently used an exe generated using the gcc compiler. I ran the icc generated exe and it was indeed crashing with an illegal instruction error on the SSE42 machine.

Upvotes: 3

Z boson
Z boson

Reputation: 33679

The Intel compiler can

generate a single executable with multiple levels of vectorization with the -ax flag,

For example to generate code which is compatible with AVX, SSE4.1 and SSE2 to use -axAVX -axSSE4.2 -xSSE2.

Since you compiled with -axAVX -xSSE4.1 Intel generated a AVX branch and a SSE4.1 branch and at runtime it determines which instruct set is available and chooses that.

Agner Fog has a good description of Intel's CPU dispatcher in his Optimizing C++ manaul. See section "13.7 CPU dispatching in Intel compiler". Intel's CPU dispatcher is not ideal for several reasons, one of which is that it plays bad on AMD, which Agner describes in detail. Personally I would make my own dispatcher.


I compiled the following code with ICC 13.0 with options -O3 -axavx -xsse2

void foo(float *x, int n) {
    for(int i=0; i<n; i++) x[i] *= 2.0;
}

and the start of the assembly is

    test      DWORD PTR __intel_cpu_indicator[rip], -131072 #1.27
    jne       _Z3fooPfi.R                                   #1.27
    test      DWORD PTR __intel_cpu_indicator[rip], -1      #1.27
    jne       _Z3fooPfi.A 

going to the _Z3fooPfi.R branch find the main AVX loop

..B2.12:                        # Preds ..B2.12 ..B2.11
vmulps    ymm1, ymm0, YMMWORD PTR [rdi+rcx*4]           #2.25
vmulps    ymm2, ymm0, YMMWORD PTR [32+rdi+rcx*4]        #2.25
vmovups   YMMWORD PTR [rdi+rcx*4], ymm1                 #2.25
vmovups   YMMWORD PTR [32+rdi+rcx*4], ymm2              #2.25
add       rcx, 16                                       #2.2
cmp       rcx, rdx                                      #2.2
jb        ..B2.12       # Prob 82%                      #2.2

going to the _Z3fooPfi.A branch has the main SSE loop

movaps    xmm1, XMMWORD PTR [rdi+r8*4]                  #2.25
movaps    xmm2, XMMWORD PTR [16+rdi+r8*4]               #2.25
mulps     xmm1, xmm0                                    #2.25
mulps     xmm2, xmm0                                    #2.25
movaps    XMMWORD PTR [rdi+r8*4], xmm1                  #2.25
movaps    XMMWORD PTR [16+rdi+r8*4], xmm2               #2.25
add       r8, 8                                         #2.2
cmp       r8, rsi                                       #2.2
jb        ..B3.12       # Prob 82%                      #2.2

Upvotes: 3

Related Questions