user938720
user938720

Reputation: 271

why the memory access is so slow?

I've got very strange performance issue, related to access to memory. The code snippet is:

#include <vector>
using namespace std;

vector<int> arrx(M,-1); 
vector< vector<int> > arr(N,arrx);

...
for(i=0;i<N;i++){
  for(j=0;j<M;j++){
    //>>>>>>>>>>>> Part 1 <<<<<<<<<<<<<<
    // Simple arithmetic operations
    int n1 = 1 + 2;  // does not matter what (actually more complicated)
    // Integer assignment, without access to array
    int n2 = n1;

    //>>>>>>>>>>>> Part 2 <<<<<<<<<<<<<<
    // This turns out to be most expensive part
    arr[i][j] = n1;
  }
}

N and M - are some constants of the order of 1000 - 10000 or so. When I compile this code (release version) it takes approximately 15 clocks to finish it if the Part 2 is commented. With this part the execution time goes up to 100+ clocks, so almost 10 times slower. I expected the assignment operation to be much cheaper than even simple arithmetic operations. This is actually true if we do not use arrays. But with that array the assignment seems to be much more expensive. I also tried 1-D array instead of 2-D - the same result (for 2D is obviously slower). I also used int** or int* instead of vector< vector< int > > or vector< int > - again the result is the same.

Why do I get such poor performance in array assignment and can I fix it?

Edit: One more observation: in the part 2 of the given code if we change the assignment from

arr[i][j] = n1;  // 172 clocks

to

n1 = arr[i][j];  // 16 clocks

the speed (numbers in comments) goes up. More interestingly, if we change the line:

arr[i][j] = n1; // 172 clocks

to

arr[i][j] = arr[i][j] * arr[i][j]; // 110 clocks

speed is also higher than for simple assignment Is there any difference in reading and writing from/to memory? Why do I get such strange performance?

Thanks in advance!

Upvotes: 1

Views: 2308

Answers (5)

ev-br
ev-br

Reputation: 26050

Let's make a simple estimate. A typical CPU clock speed nowadays is about 1--2 GHz (giga=10 to power nine). Simplifying a (really) great deal, it means that a single processor operation takes about 1ns (nano=10 to power negative nine). A simple arithmetic like int addition takes several CPU cycles, of order ten.

Now, memory: typical memory access time is about 50ns (again, it's not necessary just now to go into gory details, which are aplenty).

You see that even in the very best case scenario, the memory is slower than CPU by a factor of about 5 to 10.

In fact, I'm sweeping under the rug an enormous amount of detail, but I hope the basic idea is clear. If you're interested, there are books around (keywords: cache, cache misses, data locality etc). This one is dated, but still very good at the explaining general concepts.

Upvotes: 2

Johan Kotlinski
Johan Kotlinski

Reputation: 25739

Your assumptions are really wrong...

  1. Writing to main memory is much slower than doing a simple addition.
  2. If you don't do the write, it is likely that the loop will be optimized away entirely.

Upvotes: 6

Donal Fellows
Donal Fellows

Reputation: 137597

Your nested arrays are going to be somewhere in the region of 50–500MB of memory. It takes time to write that much memory, and no amount of clever hardware memory caching is going to help that much. Moreover, even a single memory write will take time as it has to make its way over some copper wires on a circuit board to a lump of silicon some distance away; physics wins.

But if you want to dig in more, try the cachegrind tool (assuming it's present on your platform). Just be aware that the code you're using above doesn't really permit a lot of optimization; its data access pattern doesn't have enormous amounts of reuse potential.

Upvotes: 2

zwol
zwol

Reputation: 140659

You have discovered a sad truth about modern computers: memory access is really slow compared to arithmetic. There's nothing you can do about it. It is because electric fields in a copper wire only travel at about two-thirds of the speed of light.

Upvotes: 2

user149341
user149341

Reputation:

Unless your actual "part 1" is significantly more complex than your example would lead us to believe, there's no surprise here -- memory access is slow compared to basic arithmetic. Additionally, if you're compiling with optimizations, most or all of "part 1" may be getting optimized away because the results are never used.

Upvotes: 5

Related Questions