Why does gcc add this movss instruction only with _mm_set_ss?

Question

Consider these two functions using SSE:

#include 

int ftrunc1(float f) {
    return _mm_cvttss_si32(_mm_set1_ps(f));
}

int ftrunc2(float f) {
    return _mm_cvttss_si32(_mm_set_ss(f));
}

Both are exactly the same in behaviour for any input. But the assembler output is different:

ftrunc1:
    pushl   %ebp
    movl    %esp, %ebp
    cvttss2si   8(%ebp), %eax
    leave
    ret

ftrunc2:
    pushl   %ebp
    movl    %esp, %ebp
    movss   8(%ebp), %xmm0
    cvttss2si   %xmm0, %eax
    leave
    ret

That is, ftrunc2 uses one movss instruction extra!

Is this normal? Does it matter? Should _mm_set1_ps always be preferred over _mm_set_ss when you only need to set the bottom element?

Compiler used was GCC 4.5.2 with -O3 -msse.

Mysticial · Accepted Answer

_mm_set_ss maps directly to an assembly instruction (movss). But _mm_set1_ps does not.

From what I've seen on GCC, MSVC, and ICC:

SSE intrinsics that map one-to-one to an assembly instruction are generally treated "as-is" - a black box. So the compiler will only optimizations that apply to the entire instruction itself. But it will not attempt to do any optimizations that require dataflow/dependency analysis on the individual vector elements.

The _mm_set1_ps and _mm_set_ps intrinsics do not map to a single instruction and have special case handling by most compilers. From what I've seen, all three of the compilers I've listed above do attempt to perform dataflow analysis optimizations on the individual elements.

When you put it all together, the second example leaves the movss because the compiler doesn't realize that the top 3 elements don't matter. (It makes no attempt to "open up" the _mm_set_ss intrinsic.)

Why does gcc add this movss instruction only with _mm_set_ss?

Answers (2)

Related Questions