S. Poell
S. Poell

Reputation: 1

QtConcurrent gives longer runtimes for multiple cores

I have designed an algorithm and now I'm working on an implementation to solve it on multiple cores. Essentially I'm giving each core the same problem and I'll choose the solution with the best score. However, I'm noticing that using multiple cores slows down the runtime of my code, but I don't understand why. So I created a very simple example that shows the same behaviour. I have a simple Algoritmn class:

algorithm.h

 class Algorithm
 {
 public:
    Algorithm() : mDummy(0) {};
    void runAlgorithm();

protected:
    long mDummy;
};

algorithm.cpp

    #include "algorithm.h"

    void Algorithm::runAlgorithm()
    {
        long long k = 0;
        for (long long i = 0; i < 200000; ++i)
        {
            for (long long j = 0; j < 200000; ++j)
            {
                k = k + i - j;
            }
        }
        mDummy = k;
    }

main.cpp

    #include "algorithm.h"
    #include <QtCore/QCoreApplication>
    #include <QtConcurrent/QtConcurrent>

    #include <vector>
    #include <fstream>
    #include <QFuture>
    #include <memory>

    int main(int argc, char *argv[])
    {
        QCoreApplication a(argc, argv);
        std::ofstream logFile;
        logFile.open("AlgorithmLog.log", std::ios::trunc | std::ios::out);
        if (!logFile.is_open())
        {
            return 1;
        }

        for (int i = 1; i < 8; i++)
        {
            int cores = i;
            logFile << "Start: cores = " << cores << "   " << QDateTime::currentDateTime().toString(Qt::ISODate).toLatin1().data() << "\n";

            std::vector<std::unique_ptr<Algorithm>> cvAlgorithmRuns;
            for (int j = 0; j < cores; ++j)
                cvAlgorithmRuns.push_back(std::unique_ptr<Algorithm>(new Algorithm()));

            QFuture<void> assyncCalls = QtConcurrent::map(cvAlgorithmRuns, [](std::unique_ptr<Algorithm>& x) { x->runAlgorithm(); });
            assyncCalls.waitForFinished();

            logFile << "End: " << QDateTime::currentDateTime().toString(Qt::ISODate).toLatin1().data() << "\n";
            logFile.flush();
        }
        logFile.close();
        return a.exec();
    }

When I run this on my laptop (I'm using VS2015, x64, Qt 5.9.0, 8 logical processors) I get:

Start: cores = 1   2018-06-28T10:48:30 End: 2018-06-28T10:48:44
Start: cores = 2   2018-06-28T10:48:44 End: 2018-06-28T10:48:58
Start: cores = 3   2018-06-28T10:48:58 End: 2018-06-28T10:49:13
Start: cores = 4   2018-06-28T10:49:13 End: 2018-06-28T10:49:28
Start: cores = 5   2018-06-28T10:49:28 End: 2018-06-28T10:49:43
Start: cores = 6   2018-06-28T10:49:43 End: 2018-06-28T10:49:58
Start: cores = 7   2018-06-28T10:49:58 End: 2018-06-28T10:50:13

Which makes sense: the same runtime (between 14 and 15 seconds) for all steps, whether I'm using 1 core or 7 cores.

But when I change the line in algoritm.h from:

protected:
    long mDummy;

to:

protected:
    double mDummy;

I get these results:

Start: cores = 1   2018-06-28T10:52:30 End: 2018-06-28T10:52:44
Start: cores = 2   2018-06-28T10:52:44 End: 2018-06-28T10:52:59
Start: cores = 3   2018-06-28T10:52:59 End: 2018-06-28T10:53:15
Start: cores = 4   2018-06-28T10:53:15 End: 2018-06-28T10:53:32
Start: cores = 5   2018-06-28T10:53:32 End: 2018-06-28T10:53:53
Start: cores = 6   2018-06-28T10:53:53 End: 2018-06-28T10:54:14
Start: cores = 7   2018-06-28T10:54:14 End: 2018-06-28T10:54:38

Here I start with 14 seconds runtime for 1 core, but the runtime increases to 24 seconds using 7 cores.

Can anybody explain why in the second run the runtime increases when using multiple cores?

Upvotes: 0

Views: 230

Answers (1)

Sergio Monteleone
Sergio Monteleone

Reputation: 2886

I believe the problem lies in the actual number of FPUs you have as suggested by @Aconcagua. "logical processors" aka "hyper threading" is not the same as having twice the cores.

8 cores in hyper threading are still 4 "real" cores. If you look closely at your timings, you will see that the execution times are almost the same until you use more than 4 threads. When you use more than 4 threads, you may start running out of FPUs.

However, to have a better understanding of the issue I would suggest to have a look at the actual assembly code produced.

When we want to measure raw performance, we must keep in mind that our C++ code is just an higher level representation, and the actual executable may be quite different than what we would expect.

The compiler will perform its optimizations, the CPU will execute things out of order, etc...

Therefore, first of all I would recommend to avoid the use of constant limits in your loops. Depending on the case, the compiler may unroll the loop or even replace it entirely with the result of its calculation.

As an example, the code:

int main()
{
    int z = 0;
    for(int k=0; k < 1000; k++)
        z += k;

    return z;
}

is compiled by GCC 8.1 with optimizations -O2 as:

main:
  mov eax, 499500
  ret

As you can see the loop just disappeared!

The compiler replaced it with the actual end result.

Using an example like this to measure performance is dangerous. With the example above, iterating 1000 times or 80000 times is exactly the same, because the loop is replaced with a constant in both cases (of course, if you overflow your loop variabile the compiler can't replace it anymore).

MSVC is not that aggressive, but you never know exactly what the optimizer does, unless you look at the assembly code.

The problem with looking at the produced assembly code is that it can be massive...

A simple way to solve the issue is to use the great compiler explorer. Just type in your C/C++ code, select the compiler you want to use and see the result.

Now, back to your code, I tested it with compiler explorer using MSVC2015 for x86_64.

Without optimizations they assembly code looks almost the same, except for the intrinsic at the end to convert to double (cvtsi2sd).

However, things start to get interesting when we enable optimizations (which is the default when compiling in release mode).

Compiling with the flag -O2, the assembly code produced when mDummy is a long variable (32 bit) is:

Algorithm::runAlgorithm, COMDAT PROC
        xor      r8d, r8d
        mov      r9d, r8d
        npad     10
$LL4@runAlgorit:
        mov      rax, r9
        mov      edx, 100000          ; 000186a0H
        npad     8
$LL7@runAlgorit:
        dec      r8
        add      r8, rax
        add      rax, -4
        sub      rdx, 1
        jne      SHORT $LL7@runAlgorit
        add      r9, 2
        cmp      r9, 400000             ; 00061a80H
        jl       SHORT $LL4@runAlgorit
        mov      DWORD PTR [rcx], r8d
        ret      0
Algorithm::runAlgorithm ENDP

end when mDummy is a float:

Algorithm::runAlgorithm, COMDAT PROC
        mov      QWORD PTR [rsp+8], rbx
        mov      QWORD PTR [rsp+16], rdi
        xor      r10d, r10d
        xor      r8d, r8d
$LL4@runAlgorit:
        xor      edx, edx
        xor      r11d, r11d
        xor      ebx, ebx
        mov      r9, r8
        xor      edi, edi
        npad     4
$LL7@runAlgorit:
        add      r11, -3
        add      r10, r9
        mov      rax, r8
        sub      r9, 4
        sub      rax, rdx
        dec      rax
        add      rdi, rax
        mov      rax, r8
        sub      rax, rdx
        add      rax, -2
        add      rbx, rax
        mov      rax, r8
        sub      rax, rdx
        add      rdx, 4
        add      r11, rax
        cmp      rdx, 200000          ; 00030d40H
        jl       SHORT $LL7@runAlgorit
        lea      rax, QWORD PTR [r11+rbx]
        inc      r8
        add      rax, rdi
        add      r10, rax
        cmp      r8, 200000             ; 00030d40H
        jl       SHORT $LL4@runAlgorit
        mov      rbx, QWORD PTR [rsp+8]
        xorps    xmm0, xmm0
        mov      rdi, QWORD PTR [rsp+16]
        cvtsi2ss xmm0, r10
        movss    DWORD PTR [rcx], xmm0
        ret      0
Algorithm::runAlgorithm ENDP

Without getting into the details of how these two codes work or why the optimizer behaves differently in the two cases, we can clearly see some differences.

In particular, the second version (the one with mDummy being float):

  • is slightly longer
  • uses more registers
  • access memory more often

So aside from the hyper threading issue, the second version is more likely to produce cache misses, and since cache is shared, this can also affect the final execution times.

Moreover, things like turbo boost may kick in as well. Your CPU may be throttling down when stressing it, causing an increase in the overall execution time.

For the records, this is what clang produces with optimizations turned on:

Algorithm::runAlgorithm(): # @Algorithm::runAlgorithm()
  mov dword ptr [rdi], 0
  ret

Confused? Well... nobody is using mDummy elsewhere so clang decided to remove the whole thing entirely... :)

Upvotes: 1

Related Questions