Vincent
Vincent

Reputation: 60421

Strange optimization result with addcarry on gcc

Consider the following code that benchmarks the addcarry intrinsic:

// Preamble
#include <iostream>

// Addcarry wrapper
template <class C, class T>
C addcarry(C carry, T src0, T src1, T* dst)
{
    unsigned long long int d = 0;
    carry = __builtin_ia32_addcarryx_u64(carry, src0, src1, &d);
    *dst = d;
    return carry;
}

// Main function
int main(int argc, char* argv[])
{
    // Initialization
    unsigned long long int n = argc > 1 ? std::stoull(argv[1]) : 1ULL << 31;
    unsigned char carry = 0;
    unsigned long long int src1 = 0;
#ifdef NOSTOULL
    unsigned long long int dst = 0;
#else
    unsigned long long int dst = std::stoull("0");
#endif

    // Computation
    for (unsigned long long int src0 = 0; src0 < n; ++src0) {
        src1 = dst;
#ifdef NOWRAPPER
        carry = __builtin_ia32_addcarryx_u64(carry, src0, src1, &dst);
#else
        carry = addcarry(carry, src0, src1, &dst);
#endif
    }

    // Finalization
    return dst + carry;
}

I compile with the following command (where [] denote options):

[g++6.3.0/g++7.1.0] -Wall -Wextra -pedantic -O3 -g [-DNOSTOULL]
[-DNOWRAPPER] addcarry_loop.cpp -o addcarry_loop -madx

And then I execute with:

time ./addcarry_loop

And I obtain the following real/user times (over several runs):

(1) [-DNOSTOULL] [-DNOWRAPPER] => ~0m2.64s
(2) [          ] [-DNOWRAPPER] => ~0m2.61s
(3) [-DNOSTOULL] [           ] => ~0m2.48s
(4) [          ] [           ] => ~0m1.86s

We can note that:

How to explain these results, and in particular the last one (which makes no sense to me). How initializing a variable with stoull and wrapping a function can make the code faster?

Note: experiments with other compilers/other architectures are welcome (adx instruction set needed).

EDIT: Given the commentaries, I updated the code, extracting the loop from the main function:

// Preamble
#include <iostream>

// Addcarry wrapper
template <class C, class T>
C addcarry(C carry, T src0, T src1, T* dst)
{
    unsigned long long int d = 0;
    carry = __builtin_ia32_addcarryx_u64(carry, src0, src1, &d);
    *dst = d;
    return carry;
}

// Compute
unsigned long long int compute(unsigned long long int n)
{
    // Initialization
    unsigned char carry = 0;
    unsigned long long int src1 = 0;
#ifdef NOSTOULL
    unsigned long long int dst = 0;
#else
    unsigned long long int dst = std::stoull("0");
#endif

    // Computation
    for (unsigned long long int src0 = 0; src0 < n; ++src0) {
        src1 = dst;
#ifdef NOWRAPPER
        carry = __builtin_ia32_addcarryx_u64(carry, src0, src1, &dst);
#else
        carry = addcarry(carry, src0, src1, &dst);
#endif
    }

    // Finalization
    return dst + carry;
}

// Main function
int main(int argc, char* argv[])
{
    return compute(argc > 1 ? std::stoull(argv[1]) : 1ULL << 31);
}

The assembly corresponding to the loop seems a little be different. On the following image, the left side corresponds to empty options [] [] while the right side corresponds to [] [-DNOWRAPPER] (the left side is the "fast" one, the right one is the "slow" one).

Valgrin results

Upvotes: 1

Views: 310

Answers (1)

geza
geza

Reputation: 29960

I cannot reproduce your results, I've tried gcc 6.3.0, and gcc 7.1.0. Here, all your variants run at the same speed. However, for some reason, your disassembly differs from mine. Looking at your disassembly, it has some strange things. For example, on the left side, at 0x400d7d, there's an unnecessary memory move: it could be moved out after the loop. Certainly, a good programmer could write a better code here (the best code in this case is to remove the loop completely, and apply the mathematical formula for it).

In my opinion, compilers are still not good enough. They becomes better and better (great work compiler developers!), but sometimes they generate code far from optimal.

Here's my last experience: I've written a huffman decoder. Clang generated a code which run at almost half speed than GCC's one. This is a big difference. I've checked disassembly, and clang tried to "merge" two 32-bit variables into 64-bit registers (maybe it tried so hard to avoid using the stack?). I've modified the code little bit, then clang code suddenly became a little bit faster than GCC's one.

When creating such a little loop as yours, every little detail matters. Modifying the code little bit can cause a huge speed difference. And maybe the same compiled code behaves differently on a different processor (Intel/AMD).

My suggestion is this: when writing performance sensitive loop, try to put the loop into a separate function. This function should not contain any prologue or epilogue code, just the loop. This way you can help the compiler to optimize better (to use registers as efficiently as possible). My huffman decoder become 20% faster using this technique. I'm not saying that you should stick to this rule 100%, but usually it works for me.

Upvotes: 1

Related Questions