MrPisarik
MrPisarik

Reputation: 1290

Stack allocation feature (performance)

During my little performance issues investigation, I noticed an interesting stack allocation feature, here it is template for measure time:

#include <chrono>
#include <iostream>

using namespace std;
using namespace std::chrono;

int x; //for simple optimization suppression
void foo();

int main()
{   
    const size_t n = 10000000; //ten millions
    auto start = high_resolution_clock::now();

    for (size_t i = 0; i < n; i++)
    {
        foo();
    }

    auto finish = high_resolution_clock::now();
    cout << duration_cast<milliseconds>(finish - start).count() << endl;
}

Now it's all about foo() implementation, in each implementation will be allocated in total 500000 ints:

  1. Allocated in one chunk:

    void foo()
    {
        const int size = 500000;
        int a1[size];
    
        x = a1[size - 1];
    }  
    

    Result: 7.3 seconds;

  2. Allocated in two chunks:

    void foo()
    {
        const int size = 250000;
        int a1[size];
        int a2[size];
    
        x = a1[size - 1] + a2[size - 1];
    }
    

    Result: 3.5 seconds;

  3. Allocated in four chunks:

    void foo()
    {
        const int size = 125000;
        int a1[size];
        int a2[size];
        int a3[size];
        int a4[size];
    
        x = a1[size - 1] + a2[size - 1] +
            a3[size - 1] + a4[size - 1];
    } 
    

    Result: 1.8 seconds.

and etc... I split it in 16 chunks and get result time 0.38 seconds.


Explain it to me, please, why and how this happens?
I used MSVC 2013 (v120), Release build.

UPD:
My machine is x64 platform. And I compiled it with Win32 platform.
When I compile it with x64 Platform then it yields in all cases about 40ms.
Why platform choice so much affect?

Upvotes: 8

Views: 304

Answers (2)

ead
ead

Reputation: 34377

You should look at the resulting assembler code, to see what your compiler really does with the code. For gcc/clang/icc you can use Matt Godbolt's Compiler Explorer.

clang optimizes everything out because of UB and the result is (foo - first version, foo2 - second version:

foo:                                    # @foo
        retq

foo2:                                   # @foo2
        retq

icc treats both versions very similar:

foo:
        pushq     %rbp                                          #4.1
        movq      %rsp, %rbp                                    #4.1
        subq      $2000000, %rsp                                #4.1
        movl      -4(%rbp), %eax                                #8.9
        movl      %eax, x(%rip)                                 #8.5
        leave                                                   #10.1
        ret                                                     #10.1

foo2:
        pushq     %rbp                                          #13.1
        movq      %rsp, %rbp                                    #13.1
        subq      $2000000, %rsp                                #13.1
        movl      -1000004(%rbp), %eax                          #18.9
        addl      -4(%rbp), %eax                                #18.24
        movl      %eax, x(%rip)                                 #18.5
        leave                                                   #19.1
        ret 

and gcc creates different assembler code for different version. Version 6.1 produces code which would show similar behavior as your experiments:

foo:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $2000016, %rsp
        movl    1999996(%rsp), %eax
        movl    %eax, x(%rip)
        leave
        ret
foo2:
        pushq   %rbp
        movl    $1000016, %edx  #only the first array is allocated
        movq    %rsp, %rbp
        subq    %rdx, %rsp
        leaq    3(%rsp), %rax
        subq    %rdx, %rsp
        shrq    $2, %rax
        movl    999996(,%rax,4), %eax
        addl    999996(%rsp), %eax
        movl    %eax, x(%rip)
        leave
        ret

Thus the only way to understand the difference is to look at the assembler code produced by your compiler, everything else is just guessing.

Upvotes: 1

1201ProgramAlarm
1201ProgramAlarm

Reputation: 32727

Looking at disassembly from VS2015 Update 3, in the 2 and 4 array versions of foo, the compiler optimizes out the unused arrays so that it only reserves stack space for 1 array in each function. Since the later functions have smaller arrays this takes less time. The assignment to x reads the same memory location for both/all 4 arrays. (Since the arrays are uninitialized, reading from them is undefined behavior.) Without optimizing the code there are 2 or 4 distinct arrays that are read from.

The long time taken for these functions is due to stack probes performed by __chkstk as part of stack overflow detection (necessary when the compiler needs more than 1 page of space to hold all the local variables).

Upvotes: 9

Related Questions