gani
gani

Reputation: 37

Copy local array is faster than array from arguments in c++?

While optimizing some code I discovered some things that I didn't expected. I wrote a simple code to illustrate what I found below:

#include <string.h>
#include <chrono>
#include <iostream>

using namespace std;

int globalArr[1024][1024];

void initArr(int arr[1024][1024])
{
    memset(arr, 0, 1024 * 1024 * sizeof(int));
}


void run()
{
    int arr[1024][1024];
    initArr(arr);
    for(int i = 0; i < 1024; ++i)
    {
        for(int j = 0; j < 1024; ++j)
        {
            globalArr[i][j] = arr[i][j];
        }

    }
}

void run2(int arr[1024][1024])
{
    initArr(arr);
    for(int i = 0; i < 1024; ++i)
    {
        for(int j = 0; j < 1024; ++j)
        {
            globalArr[i][j] = arr[i][j];
        }

    }
}

int main()
{
    {
        auto start = chrono::high_resolution_clock::now();
        for(int i = 0; i < 256; ++i)
        {
            run();
        }
        auto duration = chrono::high_resolution_clock::now() - start;
        cout << "(run) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";
    }

    {
        auto start = chrono::high_resolution_clock::now();
        for(int i = 0; i < 256; ++i)
        {
            int arr[1024][1024];
            run2(arr);
        }
        auto duration = chrono::high_resolution_clock::now() - start;
        cout << "(run2) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";        
    }

    return 0;
}

I build the code with g++ version 6.4.0 20180424 with -O3 flag. Below is the result running on ryzen 1700.

(run) Total time: 43493 microseconds
(run2) Total time: 134740 microseconds

I tried to see the assembly with godbolt.org (Code separated in 2 urls)

https://godbolt.org/g/aKSHH6

https://godbolt.org/g/zfK14x

But I still don't understand what actually made the difference.

So my questions are: 1. What's causing the performance difference? 2. Is it possible passing array in argument with the same performance as local array?

Edit: Just some extra info, below is the result build using O2

(run) Total time: 94461 microseconds
(run2) Total time: 172352 microseconds

Edit again: From xaxxon's comment, I try remove the initArr call in both functions. And the result actually run2 is better than run

(run) Total time: 45151 microseconds
(run2) Total time: 35845 microseconds

But I still don't understand the reason.

Upvotes: 1

Views: 131

Answers (1)

Useless
Useless

Reputation: 67743

  1. What's causing the performance difference?

The compiler has to generate code for run2 that will continue to work correctly if you call

run2(globalArr);

or (worse), pass in some overlapping but non-identical address.

If you allow your C++ compiler to inline the call, and it chooses to do so, it'll be able to generate inlined code that knows whether the parameter really aliases your global. The out-of-line codegen still has to be conservative though.

  1. Is it possible passing array in argument with the same performance as local array?

You can certainly fix the aliasing problem in C, using the restrict keyword, like

void run2(int (* restrict globalArr2)[256])
{
    int (* restrict g)[256] = globalArr1;
    for(int i = 0; i < 32; ++i)
    {
        for(int j = 0; j < 256; ++j)
        {
            g[i][j] = globalArr2[i][j];
        }
    }
}

(or probably in C++ using the non-standard extension __restrict).

This should allow the optimizer as much freedom as it had in your original run - unless it's smart enough to elide the local entirely and simply set the global to zero.

Upvotes: 4

Related Questions