Throughput of trivially concurrent code does not increase with the number of threads

Question

I am trying to use OpenMP to benchmark the speed of data structure that I implemented. However, I seem to make a fundamental mistake: the throughput decreases instead of increasing with the number of threads no matter what operation I try to benchmark. Below you can see the code that tries to benchmark the speed of a for-loop, as such I would expect it to scale (somewhat) linearly with the number of threads, it doesn't (compiled on a dualcore laptop with and without -O3 flag on g++ with c++11).

#include 
#include 
#include 
#include 

thread_local const int OPS = 10000;
thread_local const int TIMES = 200;

double get_tp(int THREADS)
{
    double threadtime[THREADS] = {0};

    //Repeat the test many times
    for(int iteration = 0; iteration < TIMES; iteration++)
    {
        #pragma  omp  parallel num_threads(THREADS)
        {
            double start, stop;
            int loc_ops = OPS/float(THREADS);
            int t = omp_get_thread_num();

            //Force all threads to start at the same time
            #pragma  omp  barrier
            start = omp_get_wtime();


            //Do a certain kind of operations loc_ops times
            for(int i = 0; i < loc_ops; i++)
            {
                //Here I would put the operations to benchmark
                //in this case a boring for loop
                int x = 0;
                for(int j = 0; j < 1000; j++)
                    x++;
            }

        stop = omp_get_wtime();
        threadtime[t] += stop-start;
        }
    }

    double total_time = 0;
    std::cout << "
Thread times: ";
    for(int i = 0; i < THREADS; i++)
    {
        total_time += threadtime[i];
        std::cout << threadtime[i] << ", ";
    }
    std::cout << "
Total time: " << total_time << "
";
    double mopss = float(OPS)*TIMES/total_time;
    return mopss;
}

int main()
{
    std::cout << "
1  " << get_tp(1) << "ops/s
";
    std::cout << "
2  " << get_tp(2) << "ops/s
";
    std::cout << "
4  " << get_tp(4) << "ops/s
";
    std::cout << "
8  " << get_tp(8) << "ops/s
";
}

Outputs with -O3 on a dualcore, so we don't expect the throughput to increase after 2 threads, but it does not even increase when going from 1 to 2 threads it decreases by 50%:

1 Thread 
Thread times: 7.411e-06, 
Total time: 7.411e-06
2.69869e+11 ops/s

2 Threads 
Thread times: 7.36701e-06, 7.38301e-06, 
Total time: 1.475e-05
1.35593e+11ops/s

4 Threads 
Thread times: 7.44301e-06, 8.31901e-06, 8.34001e-06, 7.498e-06, 
Total time: 3.16e-05
6.32911e+10ops/s

8 Threads 
Thread times: 7.885e-06, 8.18899e-06, 9.001e-06, 7.838e-06, 7.75799e-06, 7.783e-06, 8.349e-06, 8.855e-06, 
Total time: 6.5658e-05
3.04609e+10ops/s

To make sure that the compiler does not remove the loop, I also tried outputting "x" after measuring the time and to the best of my knowledge the problem persists. I also tried the code on a machine with more cores and it behaved very similarly. Without -O3 the throughput also does not scale. So there is clearly something wrong with the way I benchmark. I hope you can help me.

Hristo Iliev · Accepted Answer

I'm not sure why you are defining performance as the total number of operations per total CPU time and then get surprised by the decreasing function of the number of threads. This will almost always and universally be the case except for when cache effects kick in. The true performance metric is the number of operations per wall-clock time.

It is easy to show with simple mathematical reasoning. Given a total work W and processing capability of each core P, the time on a single core is T_1 = W / P. Dividing the work evenly among n cores means each of them works for T_1,n = (W / n + H) / P, where H is the overhead per thread induced by the parallelisation itself. The sum of those is T_n = n * T_1,n = W / P + n (H / P) = T_1 + n (H / P). The overhead is always a positive value, even in the trivial case of so-called embarrassing parallelism where no two threads need to communicate or synchronise. For example, launching the OpenMP threads takes time. You cannot get rid of the overhead, you can only amortise it over the lifetime of the threads by making sure that each one get a lot to work on. Therefore, T_n > T_1 and with fixed number of operations in both cases the performance on n cores will always be lower than on a single core. The only exception of this rule is the case when the data for work of size W doesn't fit in the lower-level caches but that for work of size W / n does. This results in massive speed-up that exceeds the number of cores, known as superlinear speed-up. You are measuring inside the thread function so you ignore the value of H and T_n should more or less be equal to T_1 within the timer precision, but...

With multiple threads running on multiple CPU cores, they all compete for limited shared CPU resources, namely last-level cache (if any), memory bandwidth, and thermal envelope.

The memory bandwidth is not a problem when you are simply incrementing a scalar variable, but becomes the bottleneck when the code starts actually moving data in and out of the CPU. A canonical example from numerical computing is the sparse matrix-vector multiplication (spMVM) -- a properly optimised spMVM routine working with double non-zero values and long indices eats so much memory bandwidth, that one can completely saturate the memory bus with as low as two threads per CPU socket, making an expensive 64-core CPU a very poor choice in that case. This is true for all algorithms with low arithmetic intensity (operations per unit of data volume).

When it comes to the thermal envelope, most modern CPUs employ dynamic power management and will overclock or clock down the cores depending on how many of them are active. Therefore, while n clocked down cores perform more work in total per unit of time than a single core, a single core outperforms n cores in terms of work per total CPU time, which is the metric you are using.

With all this in mind, there is one last (but not least) thing to consider -- timer resolution and measurement noise. Your run times are in couples of microseconds. Unless your code is running on some specialised hardware that does nothing else but run your code (i.e., no time sharing with daemons, kernel threads, and other processes and no interrupt handing), you need benchmarks that run several orders of magnitude longer, preferably for at least a couple of seconds.

Throughput of trivially concurrent code does not increase with the number of threads

Answers (2)

Related Questions