curiouscupcake
curiouscupcake

Reputation: 1277

Very low FLOPs/second without any data transfer

I tested the following code on my machine to see how much throughput I can get. The code does not do very much except assigning each thread two nested loop,

#include <chrono>
#include <iostream>


int main() {
    auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
    for(int thread = 0; thread < 24; thread++) {
        float i = 0.0f;
        while(i < 100000.0f) {
            float j = 0.0f;
            while (j < 100000.0f) {
                j = j + 1.0f;
            }
            i = i + 1.0f;
        }
    }
    auto end_time = std::chrono::high_resolution_clock::now();
    auto time = end_time - start_time;
    std::cout << time / std::chrono::milliseconds(1) << std::endl;
    return 0;
}

To my surprise, the throughput is very low according to perf

$ perf stat -e all_dc_accesses -e fp_ret_sse_avx_ops.all cmake-build-release/roofline_prediction
8907

 Performance counter stats for 'cmake-build-release/roofline_prediction':

       325.372.690      all_dc_accesses                                             
   240.002.400.000      fp_ret_sse_avx_ops.all                                      

       8,909514307 seconds time elapsed

     202,819795000 seconds user
       0,059613000 seconds sys

With 240.002.400.000 FLOPs in 8.83 seconds, the machine achieved only 27.1 GFLOPs/second, way below the CPU's capacity of 392 GFLOPs/sec (I got this number from a roofline modelling software).

My question is, how can I achieved higher throughput?

Upvotes: 2

Views: 179

Answers (1)

user555045
user555045

Reputation: 64913

Compiled with GCC 9.3 with those options, the inner loop looks like this:

.L3:
        addss   xmm0, xmm2
        comiss  xmm1, xmm0
        ja      .L3

Some other combinations of GCC version / options may result in the loop being elided, after all it doesn't really do anything (except waste time).

The addss forms a loop-carried dependency with only itself in it. That is not fast though, on Zen 1 that takes 3 cycles per iteration, so the number of additions per cycle is 1/3. The maximum number of floating point additions per cycle could be attained by having at least 6 independent addps instructions (256bit vaddps may help a bit, but Zen 1 executes such 256bit SIMD instructions with 2 128bit operations internally), to deal with the latency of 3 and the throughput of 2 per cycle (so 6 operations need to be active at any time). That would correspond to 8 additions per cycles, 24 times as much as the current code.

From a C++ program, it may be possible to coax the compiler into generating suitable machine code by:

  • Using -ffast-math (if possible, which it isn't always)
  • Using explicit vectorization using _mm_add_ps
  • Manually unrolling the loop, using (at least 6) independent accumulators

Upvotes: 2

Related Questions