Using functional Library, slow performance of member function call

Question

When calling bound method this->UpdateB = std::bind(&Test::Update, this); (called using test.UpdateB()) its overall performance is considerably slower than calling the function directly. test.Update()

The performance decrease seams to also effect the work done in the method.

Using the site quick-bench I run the snippet below and get the following result

#include 
#include 


typedef unsigned u32;    
typedef uint64_t u64;       
constexpr auto nP = nullptr;    
constexpr bool _F = false;      
constexpr bool _T = true;       

constexpr u64 HIGH_LOAD = 1000000000;
constexpr u64 LOW_LOAD = 10;

struct Test {
    u32 counter{100000};
    u64 soak{0};
    u64 load{10};
    bool isAlive{_T};
    std::function UpdateB;
    Test() { UpdateB = std::bind( &Test::Update, this); }
    bool Update() {
        if (counter > 0) { counter --; }
        u64 i = load;
        while(i--) { soak += 1; }
        isAlive = counter > 0;
        return isAlive;
    }    
};

static void DirectCallLowLoad(benchmark::State& state) {
  Test test;  
  test.load = LOW_LOAD;
  for (auto _ : state) { test.Update(); }
}
BENCHMARK(DirectCallLowLoad);

static void DirectCallHighLoad(benchmark::State& state) {
  Test test;  
  test.load = HIGH_LOAD;
  for (auto _ : state) { test.Update(); }
}
BENCHMARK(DirectCallHighLoad);

static void BoundCallLowLoad(benchmark::State& state) {
  Test test;   
  test.load = LOW_LOAD;
  for (auto _ : state) { test.UpdateB();  }
}
BENCHMARK(BoundCallLowLoad);

static void BoundCallHighLoad(benchmark::State& state) {
  Test test;   
  test.load = HIGH_LOAD;
  for (auto _ : state) { test.UpdateB(); }
}
BENCHMARK(BoundCallHighLoad);

The expectation is that...

BoundCallHighLoad performance would get closer to DirectCallHighLoad as the call overhead has less effect compared to the method's load .
DirectCallLowLoad performance would be significantly better than DirectCallHighLoad (same for bound calls.)
Bound calls would not be almost 5 times slower than direct calls.

What is wrong with my code?

Why are bound calls so slow?

If I use

    std::function UpdateB;
    Test() { UpdateB = &Test::Update;  }  // Test constructor
     
    // call using
    test.UpdateB(&test);

it gets even worse, the call test.UpdateB(&test); is many magnitudes slower than the direct call test.Update() with processing load making little to no difference.

J&#233;r&#244;me Richard · Accepted Answer

First of all, there is a "Show Noop bar" to see the overhead of a noop instruction (theoretically 1 cycle). Here is the result:

Thus, one can clearly see that DirectCallLowLoad and DirectCallHightLoad are optimized out and the benchmark is biased. Indeed, this is impossible for a CPU to execute 1000000000 iterations in about 2 cycles. In fact, it is not even possible to do 10 iterations. The same thing applies for the two other though there is an additional overhead.

The reason why this code is optimized is that Clang is able to know that repeating soak += 1; load times is equivalent to adding load to soak. In fact, it can does more advanced mathematical operations like 1+2+3+...+N = N*(N-1)/2. It is not easy to trick a compiler but one solution is to computes things that are mathematically proven to be hard like computing the hailstone sequence for example. If a compiler is able to optimize this, then it would be able to prove the Collatz conjecture which is not yet proven. Note it is better if the compiler cannot know the initial value which should be the purpose of the QuickBench loop state. Note also that any random number general could potentially also do the job.

UpdateB is slower than Update because of the additional runtime indirection. While compilers could theoretically optimize such a code thanks to specialization, they often do not in by default because it would be too expensive. In fact, in this specific benchmark, they can hardly do that without any help since the QuickBench loop state should prevent compiler to optimize it. One solution to reduce this overhead is to use profile-guided optimizations so the compiler can make speculative optimizations based on the actual function set in practice. That being said, it should still not be enough in this QuickBench due to to the loop state.

Note that you can see the assembly code in QuickBench directly and see for example that the code is inlined for the first two cases. Here is the main loop of the first bench:

       mov    %edi,0x8(%rsp)
3.26%  add    $0xfffffffffffffffc,%r12
       je     2131ca 
       mov    %eax,%esi
0.50%  mov    %eax,%edi
5.56%  sub    $0x1,%edi
4.10%  cmovb  %r8d,%edi
       mov    %edi,%eax
0.44%  sub    $0x1,%eax
9.14%  cmovae %eax,%edi
8.73%  setb   %dl
2.70%  cmovb  %r8d,%eax
8.17%  sub    $0x1,%eax
7.46%  cmovae %eax,%edi
10.13% cmovb  %r8d,%eax
1.43%  setb   %cl
7.89%  sub    $0x1,%eax
1.80%  cmovae %eax,%edi
9.42%  setb   %bl
2.30%  cmovb  %r8d,%eax
8.42%  cmp    $0x1,%esi
       jae    213220 
       test   %dl,%dl
       je     213220 
1.24%  test   %cl,%cl
       je     213220 
7.27%  test   %bl,%bl
0.03%  jne    213224 
       jmp    213220

Here is the one of the third (without inlining and an indirection):

99.73% cmpq   $0x0,0x38(%rsp)
       je     213681 
       mov    %r15,%rdi
       call   *0x40(%rsp)
0.27%  add    $0xffffffffffffffff,%rbx
       jne    213640

One can see that the compiler generate a code that first check if the std::function value (ie. rsp) is valid (ie. null function pointer) and then call it before decreasing the iteration counter (rbx) for each iteration of the benchmarking loop.

As for the inner loop optimization, the code of Test::Update is not directly visible in QuickBench but you can see it in GodBolt.

Test::Update():                      # @Test::Update()
        mov     eax, dword ptr [rdi]
        test    eax, eax
        je      .LBB5_1
        add     eax, -1
        mov     dword ptr [rdi], eax
        mov     rcx, qword ptr [rdi + 16]
        test    rcx, rcx
        je      .LBB5_5
.LBB5_4:
        add     qword ptr [rdi + 8], rcx
.LBB5_5:
        test    eax, eax
        setne   al
        setne   byte ptr [rdi + 24]
        ret
.LBB5_1:
        xor     eax, eax
        mov     rcx, qword ptr [rdi + 16]
        test    rcx, rcx
        jne     .LBB5_4
        jmp     .LBB5_5

The key point of the code is the instruction add qword ptr [rdi + 8], rcx which is basically equivalent to soak += load; in C++.

Using functional Library, slow performance of member function call

Answers (1)

Related Questions