haneefmubarak
haneefmubarak

Reputation: 2051

Why is the generated assembly reordered when using intrinsics?

I was playing around a bit with intrinsics, as I needed an O (1) complexity function similar to memcmp() for a fixed input size. I ended up writing this:

#include <stdint.h>
#include <emmintrin.h>

int64_t f (int64_t a[4], int64_t b[4]) {
    __m128i *x = (void *) a, *y = (void *) b, r[2], t;
    int64_t *ret = (void *) &t;

    r[0] = _mm_xor_si128(x[0], y[0]);
    r[1] = _mm_xor_si128(x[1], y[1]);
    t = _mm_or_si128(r[0], r[1]);


    return (ret[0] | ret[1]);
}

which, when compiled turns into this:

f:
    movdqa  xmm0, XMMWORD PTR [rdi]
    movdqa  xmm1, XMMWORD PTR [rdi+16]
    pxor    xmm0, XMMWORD PTR [rsi]
    pxor    xmm1, XMMWORD PTR [rsi+16]
    por xmm0, xmm1
    movq    rdx, xmm0
    pextrq  rax, xmm0, 1
    or  rax, rdx
    ret

http://goo.gl/EtovJa (Godbolt Compiler Explorer)


After that though, I became curious as to whether I really needed to use intrinsic functions or whether I only needed the types and I could just use normal operators. I then modified the above code (only the three SSE lines, really) and ended up with this:

#include <stdint.h>
#include <emmintrin.h>

int64_t f (int64_t a[4], int64_t b[4]) {
    __m128i *x = (void *) a, *y = (void *) b, r[2], t;
    int64_t *ret = (void *) &t;

    r[0] = x[0] ^ y[0];
    r[1] = x[1] ^ y[1];
    t = r[0] | r[1];


    return (ret[0] | ret[1]);
}

which instead compiles to this:

f:
    movdqa  xmm0, XMMWORD PTR [rdi+16]
    movdqa  xmm1, XMMWORD PTR [rdi]
    pxor    xmm0, XMMWORD PTR [rsi+16]
    pxor    xmm1, XMMWORD PTR [rsi]
    por xmm0, xmm1
    movq    rdx, xmm0
    pextrq  rax, xmm0, 1
    or  rax, rdx
    ret

http://goo.gl/oDHF3z (Godbolt Compiler Explorer)


Now functionally (AFAICT), the two compiled assembly outputs are identical. In fact, it appears that they would even take the exact same amount of time and resources; that they would execute identically. However, I am curious as to why the operands in the first four instructions have been moved around. Is there some particular reason as to why one way might be done over the other?

Note: Both of the functions were compiled with GCC, with identical flags.

Upvotes: 5

Views: 558

Answers (1)

filcab
filcab

Reputation: 1984

TL;DR: From a compiler's point of view, the input code is different and might go through different places and hit different tests on the way through, which would make the output be different.

You won't see this in (a current) clang, since the intrinsics disappear when you get to IR (an intermediate representation of your code that LLVM uses), and the IR eventually gets transformed to the instructions, but the IR for both cases is the same.

If you check out that code with clang or with different versions of gcc, you'll see slight changes in the instruction scheduling. These changes are usually due to changes in the CPU scheduler or the register allocator, from version to version.

Try this out, with the two functions you provided in the same file. Try the different versions of gcc, and try different versions of clang. Clang only changes the ordering of the movd instruction, and it always emits both functions with the same instructions, since the llvm backend gets the same IR for both cases.

I don't know about the internals of GCC, but I suppose the functions happen to not hit the exact same places in the code for the scheduler and end up emitting the loads in a different order. This could happen because one of the calls to the intrinsics might not be lowered to an intermediate representation on one case, and just stay as intrinsics (not function) calls.

Upvotes: 3

Related Questions