LyingOnTheSky
LyingOnTheSky

Reputation: 2854

Why the second time iterating on large number of bytes is significantly slower? And how to fix it?

This code:

#include <memory>
#include <time.h>
#include <chrono>
#include <thread>
#include <stdio.h>
#include <stdlib.h>

void Test( ) {
#define current_milliseconds std::chrono::duration_cast<std::chrono::milliseconds>( std::chrono::system_clock::now( ).time_since_epoch( ) ).count( )
    int *c = ( int* )malloc( 1024 * 1024 * 1024 );
    int result = 0;
    auto millis = -current_milliseconds;
    //clock_t timer = -clock( );
    for ( int i = 0 ; i < 1024 * 1024 * 256 /* 1024 / 4 */; ++i )
        result += c[ i ];
    millis += current_milliseconds;
    printf( "Took: %ldms (JUST PASSING BY: %d)\n", millis, result );
    free( c );
#undef current_milliseconds
}

int main( ) {
    std::this_thread::sleep_for( std::chrono::milliseconds( 1 ) );
    Test( );
    std::this_thread::sleep_for( std::chrono::milliseconds( 1 ) );
    Test( );
    return -1;
}

I ran 7 tests and gave the last 6 outputs:

Took: 502ms (JUST PASSING BY: 0)
Took: 607ms (JUST PASSING BY: 0)

Took: 480ms (JUST PASSING BY: 0)
Took: 588ms (JUST PASSING BY: 0)

Took: 492ms (JUST PASSING BY: 0)
Took: 562ms (JUST PASSING BY: 0)

Took: 506ms (JUST PASSING BY: 0)
Took: 558ms (JUST PASSING BY: 0)

Took: 470ms (JUST PASSING BY: 0)
Took: 555ms (JUST PASSING BY: 0)

Took: 510ms (JUST PASSING BY: 0)
Took: 562ms (JUST PASSING BY: 0)

If your output is different, then try to run the executable again (Hard-disk cache misses) or try to enlarge the number of iterations and allocated bytes (Have a feeling).

Notice that the timer's code range is only on the loop and not the allocation too; then there goes the question again: why the second iteration is slower? Is there a way to fix it?

Additional Information:

  1. That PC has pentium 2.8GHZ @ 2 cores (Intel E6300) processor, 4GB RAM (had 2.2GB free RAM before executing the tests), and enterprise Intel SSD.
  2. It seem that while executing the tests, it wrote a few of 100MBs. Why it did when it had enough free RAM? (I deallocated 1GB and then allocated another 1GB, it shouldn't pass pre-swapfile it)

Upvotes: 13

Views: 729

Answers (6)

prisco.napoli
prisco.napoli

Reputation: 616

as already explained by the other people, basically you are experiencing a sort of bias (initialisation cost + OS + other hardware dependent operations that you can't control).

All results are workstation dependent (CPU + RAM + OS). For instance, my results are always close for the 1st and 2nd run (tested with and without optimisation, see later).

Just a curiosity. Instead of call Test() only 2 times, try to call it 10 or 15 in your code. You should start to see more close results.

As last thing, I don't know if you are trying to evaluate the performance of your code looking at the time execution. If yes, please consider to use a better tools like valgrind or gprof. With valgrind, you can use callgrind to get an estimate of the number of executed instructions and hardware performance (is possible pass options to simulate cache missing, branch prediction, hardware prefetcher, ecc.):

 valgrind --tool=callgrind --cache-sim=yes --branch-sim=yes YOUR_PROGRAM

Executed on Apple MB Pro late 2014:

Took: 914ms (JUST PASSING BY: 0)
Took: 911ms (JUST PASSING BY: 0)

Took: 917ms (JUST PASSING BY: 0)
Took: 913ms (JUST PASSING BY: 0)
Took: 910ms (JUST PASSING BY: 0)
Took: 916ms (JUST PASSING BY: 0)
Took: 914ms (JUST PASSING BY: 0)
Took: 907ms (JUST PASSING BY: 0)

Took: 387ms (JUST PASSING BY: 0)
Took: 387ms (JUST PASSING BY: 0)

Took: 380ms (JUST PASSING BY: 0)
Took: 387ms (JUST PASSING BY: 0)

Took: 384ms (JUST PASSING BY: 0)
Took: 382ms (JUST PASSING BY: 0)

Took: 380ms (JUST PASSING BY: 0)
Took: 381ms (JUST PASSING BY: 0)

Upvotes: 0

Jose Palma
Jose Palma

Reputation: 756

Took: 682ms (JUST PASSING BY: 0)
Took: 666ms (JUST PASSING BY: 0)

Took: 686ms (JUST PASSING BY: 0)
Took: 680ms (JUST PASSING BY: 0)

Took: 674ms (JUST PASSING BY: 0)
Took: 675ms (JUST PASSING BY: 0)

Took: 694ms (JUST PASSING BY: 0)
Took: 684ms (JUST PASSING BY: 0)

In my workstation takes the same time or even less for the second test. If you know the memory layout of the problem you can always use internal builtins to prefetch memory.

 __builtin_prefetch

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Upvotes: 1

Hans Passant
Hans Passant

Reputation: 941605

Output on my machine:

Took: 371ms (JUST PASSING BY: 0)
Took: 318ms (JUST PASSING BY: 0)

Somewhat more typical for what I expect most programmers to see when they try your program. You can make a small change to get a vastly different outcome:

int *c = (int*)malloc(1024 * 1024 * 1024);
memset(c, 0, 1024 * 1024 * 1024);          // <== added
// etc..

Which produces on my machine:

Took: 104ms (JUST PASSING BY: 0)
Took: 102ms (JUST PASSING BY: 0)

A fat x3 speed-up, just from initializing the memory content. Hope it reproduces on your machine, it should. First conclusion that you need to draw is that you've been bench-marking something completely different than the cost of your code. A very typical benchmark hazard on modern machines.

This is the behavior of a demand-paged virtual memory operating system. Like Windows, Linux or OSX. Your malloc() call never actually allocated any memory, it merely reserved address space. Just numbers to the processor, there is one for each 4096 bytes of memory. You don't pay the cost of using memory until later, when you actually address it. When the demand-paged feature comes into play.

Which happens in your result += c[ i ]; statement. At that point the pedal must meet the metal and the operating system is forced to actually make the memory available. Every 4096 bytes your program generates a page fault. The operating system steps in and maps 4096 bytes of RAM to the virtual memory page. Your program generates 1GB / 4096 = 262,144 of those page faults. You can conclude that your operating system requires roughly 400ms/262144 ~= 1.5 microseconds to handle a page fault. Mine is about twice as fast at it.

Note how the memset() call hid that cost, it generated all those page faults before you started to time the code execution. Thus truly measuring the cost of the code and avoiding the otherwise inevitable initialization overhead.

How long your first run takes is going to depend on how quickly the operating system can make the RAM available. Actual measurement can vary greatly from one try to the next, it depends on how many other processes have mapped RAM pages. Might take quite a while if the OS needs to find space and unmap pages first, preserving their content in the paging file.

How long the second run takes is going to depend on how quickly the operating system can recycle the RAM pages if there are not enough of them available to map another gigabyte. Not much of problem on mine, I have 8 GB of RAM and using only 5.6 of it right now. They need to be zero-initialized, a low priority duty on a typical OS.

So, basic conclusion here:

  • You are measuring an initialization cost, it is not at all representative for how your program is going to be affected while it continues to use the memory. Your benchmark, as-is, doesn't tell you anything.
  • The difference you observed is an operating system implementation detail, it doesn't have anything to do with your program.

Upvotes: 20

Walter A
Walter A

Reputation: 20002

(rewrite after comment)

The second time iterating is significant slower because the load on the system is different. The first Test is running in a normal situation, the second run is when the first run has just finished.
Other programs on the system are still recovering/swapping/logging after the fight for resources when the second Test is started. When you want to compare two Tests, give the system some time to return to a normal state before launching the second test.
A simple method is adding a sleep of a few seconds after the first Test().

This solution was already discussed in the comment. LyingOnTheSky confirmed that adding a sleep of 3 sec solved the mystery: The first and the second tests had exactly the same time (421ms) for the execution.

Upvotes: 2

Maxim Egorushkin
Maxim Egorushkin

Reputation: 136286

Your time measurements look inconsistent because there is too much variance in time. It looks like the CPU may be switching frequencies while you run your benchmarks.

Try disabling the CPU frequency scaling when running the benchmarks. In Windows you can do that by setting Performance profile in Power Management in the Control Panel, that locks the CPU frequency on its highest.


Let's test this hypothesis.

Here are my results of your benchmark on Linux with the default powersave governor, they are similar to yours in respect to high timing variance:

[max@localhost:~/src/test] $ cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 1.20 GHz - 5.70 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.20 GHz and 5.70 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 1.26 GHz.
  boost state support:
    Supported: yes
    Active: yes
    25500 MHz max turbo 4 active cores
    25500 MHz max turbo 3 active cores
    25500 MHz max turbo 2 active cores
    25500 MHz max turbo 1 active cores

[max@localhost:~/src/test] $ for ((i=0; i<10; ++i)); do ./test; echo; done
Took: 698ms (JUST PASSING BY: 0)
Took: 598ms (JUST PASSING BY: 0)

Took: 541ms (JUST PASSING BY: 0)
Took: 570ms (JUST PASSING BY: 0)

Took: 660ms (JUST PASSING BY: 0)
Took: 656ms (JUST PASSING BY: 0)

Took: 673ms (JUST PASSING BY: 0)
Took: 616ms (JUST PASSING BY: 0)

Took: 637ms (JUST PASSING BY: 0)
Took: 650ms (JUST PASSING BY: 0)

Took: 690ms (JUST PASSING BY: 0)
Took: 667ms (JUST PASSING BY: 0)

Took: 671ms (JUST PASSING BY: 0)
Took: 603ms (JUST PASSING BY: 0)

Took: 537ms (JUST PASSING BY: 0)
Took: 544ms (JUST PASSING BY: 0)

Took: 535ms (JUST PASSING BY: 0)
Took: 629ms (JUST PASSING BY: 0)

Took: 660ms (JUST PASSING BY: 0)
Took: 656ms (JUST PASSING BY: 0)

And here are the result with performance governor, notice how little variance there is in timings this time:

[max@localhost:~/src/test] $ sudo cpupower frequency-set --related --governor performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 4
Setting cpu: 5
Setting cpu: 6
Setting cpu: 7

[max@localhost:~/src/test] $ cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 1.20 GHz - 5.70 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.20 GHz and 5.70 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency is 4.34 GHz.
  boost state support:
    Supported: yes
    Active: yes
    25500 MHz max turbo 4 active cores
    25500 MHz max turbo 3 active cores
    25500 MHz max turbo 2 active cores
    25500 MHz max turbo 1 active cores

[max@localhost:~/src/test] $ for ((i=0; i<10; ++i)); do ./test; echo; done
Took: 539ms (JUST PASSING BY: 0)
Took: 548ms (JUST PASSING BY: 0)

Took: 543ms (JUST PASSING BY: 0)
Took: 547ms (JUST PASSING BY: 0)

Took: 542ms (JUST PASSING BY: 0)
Took: 543ms (JUST PASSING BY: 0)

Took: 548ms (JUST PASSING BY: 0)
Took: 539ms (JUST PASSING BY: 0)

Took: 538ms (JUST PASSING BY: 0)
Took: 536ms (JUST PASSING BY: 0)

Took: 536ms (JUST PASSING BY: 0)
Took: 536ms (JUST PASSING BY: 0)

Took: 546ms (JUST PASSING BY: 0)
Took: 547ms (JUST PASSING BY: 0)

Took: 559ms (JUST PASSING BY: 0)
Took: 534ms (JUST PASSING BY: 0)

Took: 537ms (JUST PASSING BY: 0)
Took: 540ms (JUST PASSING BY: 0)

Took: 538ms (JUST PASSING BY: 0)
Took: 534ms (JUST PASSING BY: 0)

It shows that it takes the same time to execute both loops, neither loop is faster.

Upvotes: 2

jthill
jthill

Reputation: 60295

You're accessing pages for the first time on that first lap, and that's going quicker than the second, so whatever memory management your OS does on a freshly-allocated gigabyte is cheaper the first time.

If your OS maps newly-allocated storage to a page full of zeros and maps newly-referenced memory to its own page on access rather than on write, and it doesn't map a gigabyte allocation to the big 2M pages then the second lap through your gigabyte you'll be TLB missing pretty much every 4K. That can get ... not cheap.

So I'm thinking that whatever your OS is doing to map a newly-referenced page might be faster than a full TLB miss. That would easily happen if, for instance, PTEs are cacheable and every time the OS allocates a new bottom-level table it clears it, allowing your TLB misses to reload from pre-primed cache lines, recently initialized on-chip, where by the second lap your serial misses would have long since flushed the entries and the CPU has to fetch them over that long, long bus (which at this level of concern might better be viewed as a network connection).

I don't know the x86 performance counters or tools to check them, but if you've got access to decent tooling it should be easy to check theories by looking at the right counter deltas.

Upvotes: 2

Related Questions