Reputation: 37
While optimizing some code I discovered some things that I didn't expected. I wrote a simple code to illustrate what I found below:
#include <string.h>
#include <chrono>
#include <iostream>
using namespace std;
int globalArr[1024][1024];
void initArr(int arr[1024][1024])
{
memset(arr, 0, 1024 * 1024 * sizeof(int));
}
void run()
{
int arr[1024][1024];
initArr(arr);
for(int i = 0; i < 1024; ++i)
{
for(int j = 0; j < 1024; ++j)
{
globalArr[i][j] = arr[i][j];
}
}
}
void run2(int arr[1024][1024])
{
initArr(arr);
for(int i = 0; i < 1024; ++i)
{
for(int j = 0; j < 1024; ++j)
{
globalArr[i][j] = arr[i][j];
}
}
}
int main()
{
{
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < 256; ++i)
{
run();
}
auto duration = chrono::high_resolution_clock::now() - start;
cout << "(run) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";
}
{
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < 256; ++i)
{
int arr[1024][1024];
run2(arr);
}
auto duration = chrono::high_resolution_clock::now() - start;
cout << "(run2) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";
}
return 0;
}
I build the code with g++ version 6.4.0 20180424 with -O3 flag. Below is the result running on ryzen 1700.
(run) Total time: 43493 microseconds
(run2) Total time: 134740 microseconds
I tried to see the assembly with godbolt.org (Code separated in 2 urls)
But I still don't understand what actually made the difference.
So my questions are: 1. What's causing the performance difference? 2. Is it possible passing array in argument with the same performance as local array?
Edit: Just some extra info, below is the result build using O2
(run) Total time: 94461 microseconds
(run2) Total time: 172352 microseconds
Edit again: From xaxxon's comment, I try remove the initArr call in both functions. And the result actually run2 is better than run
(run) Total time: 45151 microseconds
(run2) Total time: 35845 microseconds
But I still don't understand the reason.
Upvotes: 1
Views: 131
Reputation: 67743
- What's causing the performance difference?
The compiler has to generate code for run2
that will continue to work correctly if you call
run2(globalArr);
or (worse), pass in some overlapping but non-identical address.
If you allow your C++ compiler to inline the call, and it chooses to do so, it'll be able to generate inlined code that knows whether the parameter really aliases your global. The out-of-line codegen still has to be conservative though.
- Is it possible passing array in argument with the same performance as local array?
You can certainly fix the aliasing problem in C, using the restrict
keyword, like
void run2(int (* restrict globalArr2)[256])
{
int (* restrict g)[256] = globalArr1;
for(int i = 0; i < 32; ++i)
{
for(int j = 0; j < 256; ++j)
{
g[i][j] = globalArr2[i][j];
}
}
}
(or probably in C++ using the non-standard extension __restrict
).
This should allow the optimizer as much freedom as it had in your original run
- unless it's smart enough to elide the local entirely and simply set the global to zero.
Upvotes: 4