Reputation: 54041
When calling bound method this->UpdateB = std::bind(&Test::Update, this);
(called using test.UpdateB()
) its overall performance is considerably slower than calling the function directly. test.Update()
The performance decrease seams to also effect the work done in the method.
Using the site quick-bench I run the snippet below and get the following result
#include <functional>
#include <benchmark/benchmark.h>
typedef unsigned u32;
typedef uint64_t u64;
constexpr auto nP = nullptr;
constexpr bool _F = false;
constexpr bool _T = true;
constexpr u64 HIGH_LOAD = 1000000000;
constexpr u64 LOW_LOAD = 10;
struct Test {
u32 counter{100000};
u64 soak{0};
u64 load{10};
bool isAlive{_T};
std::function<bool()> UpdateB;
Test() { UpdateB = std::bind( &Test::Update, this); }
bool Update() {
if (counter > 0) { counter --; }
u64 i = load;
while(i--) { soak += 1; }
isAlive = counter > 0;
return isAlive;
}
};
static void DirectCallLowLoad(benchmark::State& state) {
Test test;
test.load = LOW_LOAD;
for (auto _ : state) { test.Update(); }
}
BENCHMARK(DirectCallLowLoad);
static void DirectCallHighLoad(benchmark::State& state) {
Test test;
test.load = HIGH_LOAD;
for (auto _ : state) { test.Update(); }
}
BENCHMARK(DirectCallHighLoad);
static void BoundCallLowLoad(benchmark::State& state) {
Test test;
test.load = LOW_LOAD;
for (auto _ : state) { test.UpdateB(); }
}
BENCHMARK(BoundCallLowLoad);
static void BoundCallHighLoad(benchmark::State& state) {
Test test;
test.load = HIGH_LOAD;
for (auto _ : state) { test.UpdateB(); }
}
BENCHMARK(BoundCallHighLoad);
The expectation is that...
BoundCallHighLoad
performance would get closer to DirectCallHighLoad
as the call overhead has less effect compared to the method's load .
DirectCallLowLoad
performance would be significantly better than DirectCallHighLoad
(same for bound calls.)
Bound calls would not be almost 5 times slower than direct calls.
What is wrong with my code?
Why are bound calls so slow?
If I use
std::function<bool(Test*)> UpdateB;
Test() { UpdateB = &Test::Update; } // Test constructor
// call using
test.UpdateB(&test);
it gets even worse, the call test.UpdateB(&test);
is many magnitudes slower than the direct call test.Update()
with processing load making little to no difference.
Upvotes: 6
Views: 254
Reputation: 50358
First of all, there is a "Show Noop bar" to see the overhead of a noop instruction (theoretically 1 cycle). Here is the result:
Thus, one can clearly see that DirectCallLowLoad and DirectCallHightLoad are optimized out and the benchmark is biased. Indeed, this is impossible for a CPU to execute 1000000000 iterations in about 2 cycles. In fact, it is not even possible to do 10 iterations. The same thing applies for the two other though there is an additional overhead.
The reason why this code is optimized is that Clang is able to know that repeating soak += 1;
load
times is equivalent to adding load
to soak
. In fact, it can does more advanced mathematical operations like 1+2+3+...+N = N*(N-1)/2
. It is not easy to trick a compiler but one solution is to computes things that are mathematically proven to be hard like computing the hailstone sequence for example. If a compiler is able to optimize this, then it would be able to prove the Collatz conjecture which is not yet proven. Note it is better if the compiler cannot know the initial value which should be the purpose of the QuickBench loop state. Note also that any random number general could potentially also do the job.
UpdateB
is slower than Update
because of the additional runtime indirection. While compilers could theoretically optimize such a code thanks to specialization, they often do not in by default because it would be too expensive. In fact, in this specific benchmark, they can hardly do that without any help since the QuickBench loop state should prevent compiler to optimize it. One solution to reduce this overhead is to use profile-guided optimizations so the compiler can make speculative optimizations based on the actual function set in practice. That being said, it should still not be enough in this QuickBench due to to the loop state.
Note that you can see the assembly code in QuickBench directly and see for example that the code is inlined for the first two cases. Here is the main loop of the first bench:
mov %edi,0x8(%rsp)
3.26% add $0xfffffffffffffffc,%r12
je 2131ca <DirectCallLowLoad(benchmark::State&)+0x13a>
mov %eax,%esi
0.50% mov %eax,%edi
5.56% sub $0x1,%edi
4.10% cmovb %r8d,%edi
mov %edi,%eax
0.44% sub $0x1,%eax
9.14% cmovae %eax,%edi
8.73% setb %dl
2.70% cmovb %r8d,%eax
8.17% sub $0x1,%eax
7.46% cmovae %eax,%edi
10.13% cmovb %r8d,%eax
1.43% setb %cl
7.89% sub $0x1,%eax
1.80% cmovae %eax,%edi
9.42% setb %bl
2.30% cmovb %r8d,%eax
8.42% cmp $0x1,%esi
jae 213220 <DirectCallLowLoad(benchmark::State&)+0x190>
test %dl,%dl
je 213220 <DirectCallLowLoad(benchmark::State&)+0x190>
1.24% test %cl,%cl
je 213220 <DirectCallLowLoad(benchmark::State&)+0x190>
7.27% test %bl,%bl
0.03% jne 213224 <DirectCallLowLoad(benchmark::State&)+0x194>
jmp 213220 <DirectCallLowLoad(benchmark::State&)+0x190>
Here is the one of the third (without inlining and an indirection):
99.73% cmpq $0x0,0x38(%rsp)
je 213681 <BoundCallLowLoad(benchmark::State&)+0xd1>
mov %r15,%rdi
call *0x40(%rsp)
0.27% add $0xffffffffffffffff,%rbx
jne 213640 <BoundCallLowLoad(benchmark::State&)+0x90>
One can see that the compiler generate a code that first check if the std::function
value (ie. rsp
) is valid (ie. null function pointer) and then call it before decreasing the iteration counter (rbx
) for each iteration of the benchmarking loop.
As for the inner loop optimization, the code of Test::Update
is not directly visible in QuickBench but you can see it in GodBolt.
Test::Update(): # @Test::Update()
mov eax, dword ptr [rdi]
test eax, eax
je .LBB5_1
add eax, -1
mov dword ptr [rdi], eax
mov rcx, qword ptr [rdi + 16]
test rcx, rcx
je .LBB5_5
.LBB5_4:
add qword ptr [rdi + 8], rcx
.LBB5_5:
test eax, eax
setne al
setne byte ptr [rdi + 24]
ret
.LBB5_1:
xor eax, eax
mov rcx, qword ptr [rdi + 16]
test rcx, rcx
jne .LBB5_4
jmp .LBB5_5
The key point of the code is the instruction add qword ptr [rdi + 8], rcx
which is basically equivalent to soak += load;
in C++.
Upvotes: 5