Reputation: 1290
During my little performance issues investigation, I noticed an interesting stack allocation feature, here it is template for measure time:
#include <chrono>
#include <iostream>
using namespace std;
using namespace std::chrono;
int x; //for simple optimization suppression
void foo();
int main()
{
const size_t n = 10000000; //ten millions
auto start = high_resolution_clock::now();
for (size_t i = 0; i < n; i++)
{
foo();
}
auto finish = high_resolution_clock::now();
cout << duration_cast<milliseconds>(finish - start).count() << endl;
}
Now it's all about foo()
implementation, in each implementation will be allocated in total 500000 ints
:
Allocated in one chunk:
void foo()
{
const int size = 500000;
int a1[size];
x = a1[size - 1];
}
Result: 7.3 seconds;
Allocated in two chunks:
void foo()
{
const int size = 250000;
int a1[size];
int a2[size];
x = a1[size - 1] + a2[size - 1];
}
Result: 3.5 seconds;
Allocated in four chunks:
void foo()
{
const int size = 125000;
int a1[size];
int a2[size];
int a3[size];
int a4[size];
x = a1[size - 1] + a2[size - 1] +
a3[size - 1] + a4[size - 1];
}
Result: 1.8 seconds.
and etc... I split it in 16 chunks and get result time 0.38 seconds.
Explain it to me, please, why and how this happens?
I used MSVC 2013 (v120), Release build.
UPD:
My machine is x64 platform. And I compiled it with Win32 platform.
When I compile it with x64 Platform then it yields in all cases about 40ms.
Why platform choice so much affect?
Upvotes: 8
Views: 304
Reputation: 34377
You should look at the resulting assembler code, to see what your compiler really does with the code. For gcc/clang/icc you can use Matt Godbolt's Compiler Explorer.
clang optimizes everything out because of UB and the result is (foo
- first version, foo2
- second version:
foo: # @foo
retq
foo2: # @foo2
retq
icc treats both versions very similar:
foo:
pushq %rbp #4.1
movq %rsp, %rbp #4.1
subq $2000000, %rsp #4.1
movl -4(%rbp), %eax #8.9
movl %eax, x(%rip) #8.5
leave #10.1
ret #10.1
foo2:
pushq %rbp #13.1
movq %rsp, %rbp #13.1
subq $2000000, %rsp #13.1
movl -1000004(%rbp), %eax #18.9
addl -4(%rbp), %eax #18.24
movl %eax, x(%rip) #18.5
leave #19.1
ret
and gcc creates different assembler code for different version. Version 6.1 produces code which would show similar behavior as your experiments:
foo:
pushq %rbp
movq %rsp, %rbp
subq $2000016, %rsp
movl 1999996(%rsp), %eax
movl %eax, x(%rip)
leave
ret
foo2:
pushq %rbp
movl $1000016, %edx #only the first array is allocated
movq %rsp, %rbp
subq %rdx, %rsp
leaq 3(%rsp), %rax
subq %rdx, %rsp
shrq $2, %rax
movl 999996(,%rax,4), %eax
addl 999996(%rsp), %eax
movl %eax, x(%rip)
leave
ret
Thus the only way to understand the difference is to look at the assembler code produced by your compiler, everything else is just guessing.
Upvotes: 1
Reputation: 32727
Looking at disassembly from VS2015 Update 3, in the 2 and 4 array versions of foo
, the compiler optimizes out the unused arrays so that it only reserves stack space for 1 array in each function. Since the later functions have smaller arrays this takes less time. The assignment to x reads the same memory location for both/all 4 arrays. (Since the arrays are uninitialized, reading from them is undefined behavior.) Without optimizing the code there are 2 or 4 distinct arrays that are read from.
The long time taken for these functions is due to stack probes performed by __chkstk as part of stack overflow detection (necessary when the compiler needs more than 1 page of space to hold all the local variables).
Upvotes: 9