orlp
orlp

Reputation: 117641

Why does gcc add this movss instruction only with _mm_set_ss?

Consider these two functions using SSE:

#include <xmmintrin.h>

int ftrunc1(float f) {
    return _mm_cvttss_si32(_mm_set1_ps(f));
}

int ftrunc2(float f) {
    return _mm_cvttss_si32(_mm_set_ss(f));
}

Both are exactly the same in behaviour for any input. But the assembler output is different:

ftrunc1:
    pushl   %ebp
    movl    %esp, %ebp
    cvttss2si   8(%ebp), %eax
    leave
    ret

ftrunc2:
    pushl   %ebp
    movl    %esp, %ebp
    movss   8(%ebp), %xmm0
    cvttss2si   %xmm0, %eax
    leave
    ret

That is, ftrunc2 uses one movss instruction extra!

Is this normal? Does it matter? Should _mm_set1_ps always be preferred over _mm_set_ss when you only need to set the bottom element?


Compiler used was GCC 4.5.2 with -O3 -msse.

Upvotes: 5

Views: 1112

Answers (2)

Mysticial
Mysticial

Reputation: 471209

_mm_set_ss maps directly to an assembly instruction (movss). But _mm_set1_ps does not.

From what I've seen on GCC, MSVC, and ICC:

SSE intrinsics that map one-to-one to an assembly instruction are generally treated "as-is" - a black box. So the compiler will only optimizations that apply to the entire instruction itself. But it will not attempt to do any optimizations that require dataflow/dependency analysis on the individual vector elements.

The _mm_set1_ps and _mm_set_ps intrinsics do not map to a single instruction and have special case handling by most compilers. From what I've seen, all three of the compilers I've listed above do attempt to perform dataflow analysis optimizations on the individual elements.


When you put it all together, the second example leaves the movss because the compiler doesn't realize that the top 3 elements don't matter. (It makes no attempt to "open up" the _mm_set_ss intrinsic.)

Upvotes: 5

Chris Dodd
Chris Dodd

Reputation: 126175

You're running into a quirk of the peephole optimizer. For some reason in the first case it figures out that it can fold the mov into the cvttss2si and in the second case it fails. The question is, does it matter? The extra move instruction is almost free -- it takes up an extra 4 bytes in the instruction stream and an extra decode slot, but both sequences require the same number of execution slots and the same number of load/store slots (which is what usually matters). The only potential sticking point is the 4 extra bytes of ifetch -- but since ftrunc1 uses 10 bytes and ftrunc2 uses 14, both will fit in a single cache line, so you won't see any difference. For minimizing that overhead, I'd be far more concerned about the unneeded %ebp cruft (are you compiling with -fno-omit-frame-pointer? -- I though -O3 included -fomit-frame-pointer by default). You'll do even better by inlining this function, which will likely completely change what the peephole optimizer sees, and so may make it work better in either case (or even reverse the cases where it works better) -- there's no way to tell without compiling larger programs and looking at the assembly code.

Bottom line, there's unlikely to be any measurable speed difference between the two...

Upvotes: 0

Related Questions