Bartvbl
Bartvbl

Reputation: 2928

What causes my loop to run slower during its first iterations?

I've written the following program to play around with std::chrono for a bit:

#include <iostream>
#include <chrono>

int main(int argc, char** argv) {
    const long iterationCount = 10000;
    long count = 0;
    for(long i = 0; i < iterationCount; i++) {
        auto start = std::chrono::high_resolution_clock::now();
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        count++;count++;count++;count++;count++;count++;count++;count++;count++;count++;
        auto end = std::chrono::high_resolution_clock::now();

        auto timeTaken = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
        std::cout << timeTaken << " " << std::endl;
    }
}

I've compiled this using G++ without compiler optimisations enabled:

g++ chrono.cpp -o chrono

I subsequently ran this program a few times, and got an interesting pattern. For the first 500-1000 iterations the program runs about 7-8 times slower than it does for the remainder of iterations.

Here's a sample output of this program: https://pastebin.com/tUQsQEAQ

What causes this discrepancy in runtimes? My first reaction was the cache, but would that one not become saturated very quickly?

In case it matters, my operating system is Ubuntu 18.04, and my g++ version is 7.3.0.

Upvotes: 7

Views: 1247

Answers (5)

Margaret Bloom
Margaret Bloom

Reputation: 44136

After an micro-architectural implementation defined time and if the CPU can find same thermal power headroom, frequency scaling kicks in to speed up the clock of the demanding core up to a maximum (within the limits of the TDP).

Intel's implementation si called Turbo boost.

If you disable frequency scaling in your system (e.g. with sudo cpupower frequency-set --governor performance - cpupower is in the cpupowerutils package) the time of each iteration is more or less the same.


The loop itself is very easy to predict you can expect only a few mispredictions if not only one at the end from the loop control - the code inside the C++ library is likely harder to predict but even with that it won't take that long (1000 iterations) for the BPU to catch up with your code.
So you can rule out branch misprediction.

More or less the same apply for the I-cache (there is little use of the D-cache, unless the C++ library implementation was heavy on variable usage).
The code should be small enough to fit in the L1-I and mostly of it even in the DSB.
L1-I misses don't take 1000 iteration to resolve and if there would be an heavy set conflict in the cache that would show up as a general slow down that won't go away after 1000 iteration.

Generally speaking it is a know effect that code run faster from the second iteration of a loop carried dependency chain because the first time the CPU fills the caches (data, instructions, TLB).
It can eventually slow down again if the CPU runs out of resources, for example if there is a lot of port pressure (e.g. lots of identical limited-port long-latency instructions) the RS may fill up stalling the FE or if there is a lot of load/store filling the MOB/SB/LB or a lot of jump filling the BOB.
However these effects kicks in very quickly to the point that they dominate the run time of the code.
In this case the slowdown happens very very late (in CPU times) that make it desirable to think about on-demand process - like Turbo boost.

Upvotes: 4

geza
geza

Reputation: 30010

As others say, it is almost sure that you experience the effect of dynamic CPU frequency scaling.

I can reproduce your results on my machine. But, if I turn off dynamic CPU frequency scaling with the cpufreq-set utility (to make the CPU to run at a constant frequency), the effect you see goes away.

Upvotes: 1

BeeOnRope
BeeOnRope

Reputation: 65046

On most machines, when you start your test, the CPU will generally be in a lower power state, running a lower frequency - but as the test continues to run the CPU will increase in speed to the maximum frequency.

So a common pattern is to see slow times for the first few milliseconds, after which the runtime decreases (as the CPU speeds up) and ultimately stabilizes.

Upvotes: 0

mksteve
mksteve

Reputation: 13085

I don't have a linux physical machine to test this, but running on a Windows 10 x64 (i7) box I got results such as ...

395
16592
395
395
395
790
395
790
395
395
395
790

Which matches the end of your trace. On Windows, the value 395 appears to be phase locked with the clock counter, so the length of time is either 395, 790, or a really large number (e.g. 116592). The really large number would look like a context switch - where our program is not running.

Your program on my Windows machine did not have the initial slowdown.

However the results in general, look very similar to the results in the pastebin file.

So the question why the loop takes longer right at the beginning. We can see it is not a context switch, as they appear to be significantly longer (18k). So it must be that the processor is slower at doing its job at the beginning of the program. Causes for that slowness, can be that other cores on the CPU are clearing the cache for our core, or using the shared cache for the CPU.

I would validate this by adding an initial sleep for the system to settle down at launch time, running the loop only after the machine has settled.

Upvotes: 0

NoSenseEtAl
NoSenseEtAl

Reputation: 30138

I would guess that clock resolution could be the reason. You are in the range that may be bellow your system's resolution. Consider this quote from linux documentation(bold by me):

Since Linux 2.6.21, Linux supports high-resolution timers (HRTs), optionally configurable via CONFIG_HIGH_RES_TIMERS. On a system that supports HRTs, the accuracy of sleep and timer system calls is no longer constrained by the jiffy, but instead can be as accurate as the hardware allows (microsecond accuracy is typical of modern hardware).

Upvotes: 0

Related Questions