Reputation: 661
Yes this is a homework question i just need a push in the right direction
Which code block of C++ is faster and why? I Think it is the top one because the [i] array is being used in order, or am i wrong here?.
double A[100][100];
...
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 100; j++) {
A[i][j] = i * j;
}
}
double A[100][100];
...
for (int j = 0; j < 100; j++) {
for (int i = 0; i < 100; i++) {
A[i][j] = i * j;
}
}
Upvotes: 1
Views: 274
Reputation: 41374
The most general answer is: you need to profile both blocks and see the result empirically.
However I can give you an answer for the majority of modern x86, x64, PPC, and ARM processors with hierarchical caches. On these platforms the top one will be faster because of better data locality: it accesses the memory addresses sequentially, so you will hit data cache more often. Smart x86 and x64 implementations will even notice that you are reading memory sequentially in this fashion, and prefetch the next cache line before you need it. The bottom pattern accesses memory unsequentially across distant addresses, meaning that you are likely to miss cache on every read.
Ulrich Drepper has a nice paper about this. One of his examples in that paper demonstrates exactly how those two blocks of code differ.
As an example of the math here, assume arguendo that you are programming an Intel Corei7 with a 64 byte cache line size, and a 32kb L1 data cache. That means that every time you fetch an address, the processor will also fetch all other data in that 64-byte-aligned block. On that platform, a double is eight bytes, so you fit eight of them per cache line. Thus the top example will, on average, miss on one out of eight iterations: after each miss, the next 56 bytes will be fetched also, thus the next seven double* reads will be in cache.
The bottom example could probably fit 100 lines of data (one for each i
) into cache simultaneously: 100 * 64 = 6400 bytes, well within cache size. But it's also likely that you'll exceed the associative cache, meaning that two lines will map to the same SRAM in the L1, meaning that one will evict the other.
Upvotes: 6
Reputation:
There is no way to know which piece of code is faster without running and profiling your code.
We could guess about how locality and cache behaviour would influence that timing (and your guess is a good one), but guesses aren't a substitute for profiling. (See: How can I profile C++ code running in Linux?)
One reason why the first version could be faster:
Why there may be no difference:
I can't think of any reason why the second would be faster, but I've been surprised before.
Upvotes: 8