Stack allocation feature (performance)

Question

During my little performance issues investigation, I noticed an interesting stack allocation feature, here it is template for measure time:

#include 
#include 

using namespace std;
using namespace std::chrono;

int x; //for simple optimization suppression
void foo();

int main()
{   
    const size_t n = 10000000; //ten millions
    auto start = high_resolution_clock::now();

    for (size_t i = 0; i < n; i++)
    {
        foo();
    }

    auto finish = high_resolution_clock::now();
    cout << duration_cast(finish - start).count() << endl;
}

Now it's all about foo() implementation, in each implementation will be allocated in total 500000 ints:

Allocated in one chunk:

void foo()
{
    const int size = 500000;
    int a1[size];

    x = a1[size - 1];
}

Result: 7.3 seconds;

Allocated in two chunks:

void foo()
{
    const int size = 250000;
    int a1[size];
    int a2[size];

    x = a1[size - 1] + a2[size - 1];
}

Result: 3.5 seconds;

Allocated in four chunks:

void foo()
{
    const int size = 125000;
    int a1[size];
    int a2[size];
    int a3[size];
    int a4[size];

    x = a1[size - 1] + a2[size - 1] +
        a3[size - 1] + a4[size - 1];
}

Result: 1.8 seconds.

and etc... I split it in 16 chunks and get result time 0.38 seconds.

Explain it to me, please, why and how this happens?
I used MSVC 2013 (v120), Release build.

UPD:
My machine is x64 platform. And I compiled it with Win32 platform.
When I compile it with x64 Platform then it yields in all cases about 40ms.
Why platform choice so much affect?

1201ProgramAlarm · Accepted Answer

Looking at disassembly from VS2015 Update 3, in the 2 and 4 array versions of foo, the compiler optimizes out the unused arrays so that it only reserves stack space for 1 array in each function. Since the later functions have smaller arrays this takes less time. The assignment to x reads the same memory location for both/all 4 arrays. (Since the arrays are uninitialized, reading from them is undefined behavior.) Without optimizing the code there are 2 or 4 distinct arrays that are read from.

The long time taken for these functions is due to stack probes performed by __chkstk as part of stack overflow detection (necessary when the compiler needs more than 1 page of space to hold all the local variables).

Stack allocation feature (performance)

Answers (2)

Related Questions