Reputation: 2051
I was playing around a bit with intrinsics, as I needed an O (1)
complexity function similar to memcmp()
for a fixed input size. I ended up writing this:
#include <stdint.h>
#include <emmintrin.h>
int64_t f (int64_t a[4], int64_t b[4]) {
__m128i *x = (void *) a, *y = (void *) b, r[2], t;
int64_t *ret = (void *) &t;
r[0] = _mm_xor_si128(x[0], y[0]);
r[1] = _mm_xor_si128(x[1], y[1]);
t = _mm_or_si128(r[0], r[1]);
return (ret[0] | ret[1]);
}
which, when compiled turns into this:
f:
movdqa xmm0, XMMWORD PTR [rdi]
movdqa xmm1, XMMWORD PTR [rdi+16]
pxor xmm0, XMMWORD PTR [rsi]
pxor xmm1, XMMWORD PTR [rsi+16]
por xmm0, xmm1
movq rdx, xmm0
pextrq rax, xmm0, 1
or rax, rdx
ret
http://goo.gl/EtovJa (Godbolt Compiler Explorer)
After that though, I became curious as to whether I really needed to use intrinsic functions or whether I only needed the types and I could just use normal operators. I then modified the above code (only the three SSE lines, really) and ended up with this:
#include <stdint.h>
#include <emmintrin.h>
int64_t f (int64_t a[4], int64_t b[4]) {
__m128i *x = (void *) a, *y = (void *) b, r[2], t;
int64_t *ret = (void *) &t;
r[0] = x[0] ^ y[0];
r[1] = x[1] ^ y[1];
t = r[0] | r[1];
return (ret[0] | ret[1]);
}
which instead compiles to this:
f:
movdqa xmm0, XMMWORD PTR [rdi+16]
movdqa xmm1, XMMWORD PTR [rdi]
pxor xmm0, XMMWORD PTR [rsi+16]
pxor xmm1, XMMWORD PTR [rsi]
por xmm0, xmm1
movq rdx, xmm0
pextrq rax, xmm0, 1
or rax, rdx
ret
http://goo.gl/oDHF3z (Godbolt Compiler Explorer)
Now functionally (AFAICT), the two compiled assembly outputs are identical. In fact, it appears that they would even take the exact same amount of time and resources; that they would execute identically. However, I am curious as to why the operands in the first four instructions have been moved around. Is there some particular reason as to why one way might be done over the other?
Note: Both of the functions were compiled with GCC, with identical flags.
Upvotes: 5
Views: 558
Reputation: 1984
TL;DR: From a compiler's point of view, the input code is different and might go through different places and hit different tests on the way through, which would make the output be different.
You won't see this in (a current) clang, since the intrinsics disappear when you get to IR (an intermediate representation of your code that LLVM uses), and the IR eventually gets transformed to the instructions, but the IR for both cases is the same.
If you check out that code with clang or with different versions of gcc, you'll see slight changes in the instruction scheduling. These changes are usually due to changes in the CPU scheduler or the register allocator, from version to version.
Try this out, with the two functions you provided in the same file. Try the different versions of gcc, and try different versions of clang. Clang only changes the ordering of the movd instruction, and it always emits both functions with the same instructions, since the llvm backend gets the same IR for both cases.
I don't know about the internals of GCC, but I suppose the functions happen to not hit the exact same places in the code for the scheduler and end up emitting the loads in a different order. This could happen because one of the calls to the intrinsics might not be lowered to an intermediate representation on one case, and just stay as intrinsics (not function) calls.
Upvotes: 3