Reputation: 117641
Consider these two functions using SSE:
#include <xmmintrin.h>
int ftrunc1(float f) {
return _mm_cvttss_si32(_mm_set1_ps(f));
}
int ftrunc2(float f) {
return _mm_cvttss_si32(_mm_set_ss(f));
}
Both are exactly the same in behaviour for any input. But the assembler output is different:
ftrunc1:
pushl %ebp
movl %esp, %ebp
cvttss2si 8(%ebp), %eax
leave
ret
ftrunc2:
pushl %ebp
movl %esp, %ebp
movss 8(%ebp), %xmm0
cvttss2si %xmm0, %eax
leave
ret
That is, ftrunc2
uses one movss
instruction extra!
Is this normal? Does it matter? Should _mm_set1_ps
always be preferred over _mm_set_ss
when you only need to set the bottom element?
Compiler used was GCC 4.5.2 with -O3 -msse
.
Upvotes: 5
Views: 1112
Reputation: 471209
_mm_set_ss
maps directly to an assembly instruction (movss
). But _mm_set1_ps
does not.
From what I've seen on GCC, MSVC, and ICC:
SSE intrinsics that map one-to-one to an assembly instruction are generally treated "as-is" - a black box. So the compiler will only optimizations that apply to the entire instruction itself. But it will not attempt to do any optimizations that require dataflow/dependency analysis on the individual vector elements.
The _mm_set1_ps
and _mm_set_ps
intrinsics do not map to a single instruction and have special case handling by most compilers. From what I've seen, all three of the compilers I've listed above do attempt to perform dataflow analysis optimizations on the individual elements.
When you put it all together, the second example leaves the movss
because the compiler doesn't realize that the top 3 elements don't matter. (It makes no attempt to "open up" the _mm_set_ss
intrinsic.)
Upvotes: 5
Reputation: 126175
You're running into a quirk of the peephole optimizer. For some reason in the first case it figures out that it can fold the mov
into the cvttss2si
and in the second case it fails. The question is, does it matter? The extra move instruction is almost free -- it takes up an extra 4 bytes in the instruction stream and an extra decode slot, but both sequences require the same number of execution slots and the same number of load/store slots (which is what usually matters). The only potential sticking point is the 4 extra bytes of ifetch -- but since ftrunc1 uses 10 bytes and ftrunc2 uses 14, both will fit in a single cache line, so you won't see any difference. For minimizing that overhead, I'd be far more concerned about the unneeded %ebp cruft (are you compiling with -fno-omit-frame-pointer? -- I though -O3 included -fomit-frame-pointer by default). You'll do even better by inlining this function, which will likely completely change what the peephole optimizer sees, and so may make it work better in either case (or even reverse the cases where it works better) -- there's no way to tell without compiling larger programs and looking at the assembly code.
Bottom line, there's unlikely to be any measurable speed difference between the two...
Upvotes: 0