Maciej Donajski
Maciej Donajski

Reputation: 298

c++ optimization of 2 lines of critical code

With valgrind and perf/FlameGraphs I have identified part of my application which is consuming almost 100% of CPU:

for(size_t i = 0; i < objects.size(); i++) {

  //this part consumes 11% CPU -----> 
  collions_count = database->get_collisions(collisions_block, objects[i].getKey());
  feature1 = objects[i].feature1;
  //<--------

  for(int j = 0; j < collions_count * 2; j += 2) {

    hash = 
      ((collisions_block[j] & config::MASK_1) << config::SHIFT) | 
      ((collisions_block[j+1] - feature1) & config::MASK_2);

    if (++offsets[hash] >= config::THRESHOLD_1) {

      //... this part consumes < 1% of CPU

    }
  }
}

The calculation of hash and following if statement take nearly 90% of CPU of all application.

Is there any way to speed up the inner loop?

Upvotes: 4

Views: 259

Answers (1)

Maciej Donajski
Maciej Donajski

Reputation: 298

I identified the performance bottleneck to be the unordered access to array ++offsets[hash]. It was consuming most of the CPU time (75+%). I achieved 2.5x speed increase by reducing the size of the array from 1<<24 to 1<<21 and experimenting with appropriate MASKS configuration.

I will describe briefly how I identified the problem

for(size_t i = 0; i < objects.size(); i++) {

  //this part consumes 11% CPU -----> 
  collions_count = database->get_collisions(collisions_block, objects[i].getKey());
  feature1 = objects[i].feature1;
  //<--------

  for(int j = 0; j < collions_count * 2; j += 2) {

    hash = calculate_hash(collisions_block[j], 
      collisions_block[j+1],
      feature1,
      config::MASK_1,
      config::MASK_2
      config::SHIFT);

    if (check_condition(hash, config::THRESHOLD_1)) {

       //... this part consumes < 1% of CPU

    }
  }
}
  1. Split the critical 2 lines to separate functions for better profiling (be careful to put __attribute__((noinline)) to prevent gcc from inlining new functions. They will not appear on call stack if inlined)
  2. Compile the code with -g -rdynamic gcc flags
  3. Run sampling pefrormance tool perf record -p <pid> -F 200 -g --call-graph dwarf -- sleep 60
  4. Convert to FlameGraph for better readability perf script | ./stackcollapse-perf.pl > out.perf-folded && ./flamegraph.pl out.perf-folded > graph.svg
  5. Identify most expensive function from flamegraph and optimize it

Upvotes: 1

Related Questions