Reputation: 60421
Consider the following code that benchmarks the addcarry
intrinsic:
// Preamble
#include <iostream>
// Addcarry wrapper
template <class C, class T>
C addcarry(C carry, T src0, T src1, T* dst)
{
unsigned long long int d = 0;
carry = __builtin_ia32_addcarryx_u64(carry, src0, src1, &d);
*dst = d;
return carry;
}
// Main function
int main(int argc, char* argv[])
{
// Initialization
unsigned long long int n = argc > 1 ? std::stoull(argv[1]) : 1ULL << 31;
unsigned char carry = 0;
unsigned long long int src1 = 0;
#ifdef NOSTOULL
unsigned long long int dst = 0;
#else
unsigned long long int dst = std::stoull("0");
#endif
// Computation
for (unsigned long long int src0 = 0; src0 < n; ++src0) {
src1 = dst;
#ifdef NOWRAPPER
carry = __builtin_ia32_addcarryx_u64(carry, src0, src1, &dst);
#else
carry = addcarry(carry, src0, src1, &dst);
#endif
}
// Finalization
return dst + carry;
}
I compile with the following command (where []
denote options):
[g++6.3.0/g++7.1.0] -Wall -Wextra -pedantic -O3 -g [-DNOSTOULL]
[-DNOWRAPPER] addcarry_loop.cpp -o addcarry_loop -madx
And then I execute with:
time ./addcarry_loop
And I obtain the following real/user times (over several runs):
(1) [-DNOSTOULL] [-DNOWRAPPER] => ~0m2.64s
(2) [ ] [-DNOWRAPPER] => ~0m2.61s
(3) [-DNOSTOULL] [ ] => ~0m2.48s
(4) [ ] [ ] => ~0m1.86s
We can note that:
How to explain these results, and in particular the last one (which makes no sense to me). How initializing a variable with stoull
and wrapping a function can make the code faster?
Note: experiments with other compilers/other architectures are welcome (adx
instruction set needed).
EDIT: Given the commentaries, I updated the code, extracting the loop from the main function:
// Preamble
#include <iostream>
// Addcarry wrapper
template <class C, class T>
C addcarry(C carry, T src0, T src1, T* dst)
{
unsigned long long int d = 0;
carry = __builtin_ia32_addcarryx_u64(carry, src0, src1, &d);
*dst = d;
return carry;
}
// Compute
unsigned long long int compute(unsigned long long int n)
{
// Initialization
unsigned char carry = 0;
unsigned long long int src1 = 0;
#ifdef NOSTOULL
unsigned long long int dst = 0;
#else
unsigned long long int dst = std::stoull("0");
#endif
// Computation
for (unsigned long long int src0 = 0; src0 < n; ++src0) {
src1 = dst;
#ifdef NOWRAPPER
carry = __builtin_ia32_addcarryx_u64(carry, src0, src1, &dst);
#else
carry = addcarry(carry, src0, src1, &dst);
#endif
}
// Finalization
return dst + carry;
}
// Main function
int main(int argc, char* argv[])
{
return compute(argc > 1 ? std::stoull(argv[1]) : 1ULL << 31);
}
The assembly corresponding to the loop seems a little be different. On the following image, the left side corresponds to empty options [] []
while the right side corresponds to [] [-DNOWRAPPER]
(the left side is the "fast" one, the right one is the "slow" one).
Upvotes: 1
Views: 310
Reputation: 29960
I cannot reproduce your results, I've tried gcc 6.3.0, and gcc 7.1.0. Here, all your variants run at the same speed. However, for some reason, your disassembly differs from mine. Looking at your disassembly, it has some strange things. For example, on the left side, at 0x400d7d, there's an unnecessary memory move: it could be moved out after the loop. Certainly, a good programmer could write a better code here (the best code in this case is to remove the loop completely, and apply the mathematical formula for it).
In my opinion, compilers are still not good enough. They becomes better and better (great work compiler developers!), but sometimes they generate code far from optimal.
Here's my last experience: I've written a huffman decoder. Clang generated a code which run at almost half speed than GCC's one. This is a big difference. I've checked disassembly, and clang tried to "merge" two 32-bit variables into 64-bit registers (maybe it tried so hard to avoid using the stack?). I've modified the code little bit, then clang code suddenly became a little bit faster than GCC's one.
When creating such a little loop as yours, every little detail matters. Modifying the code little bit can cause a huge speed difference. And maybe the same compiled code behaves differently on a different processor (Intel/AMD).
My suggestion is this: when writing performance sensitive loop, try to put the loop into a separate function. This function should not contain any prologue or epilogue code, just the loop. This way you can help the compiler to optimize better (to use registers as efficiently as possible). My huffman decoder become 20% faster using this technique. I'm not saying that you should stick to this rule 100%, but usually it works for me.
Upvotes: 1