Reputation: 211
I have heard that counting down is faster than counting up in loops. However, I also heard that accessing memory forwards (in ascending memory address order) is faster than accessing memory in reverse. Like if I have a loop that repeats a line of code N times, counting down might be slightly faster. But if I access an array with the current value of N, and I count down, I would be accessing the memory in a backwards fashion. Would this be slower and possibly negate all performance benefits of counting down in the first place?
Would this:
short array[1024];
for (int i = 0; i < 1024; ++i) {
do_something_with(array[i]);
}
Be faster than this:
short array[1024];
for (int i = 1024; i--;) {
do_something_with(array[i]);
}
I am trying to write the most fastest code on modern machines.
Upvotes: -1
Views: 240
Reputation: 43327
The latter may be faster on some processors, if you can still measure it. The reason is quite unexpected to those who know assembly. On some processors, we can compare the result of the decrement with zero for free; but comparing with 1024 costs one instruction.
The memory access just does not matter to memory hardware; memory hardware does not work in such a way that order dependence of RAM access cares about increasing or decreasing sequences. However, processor features such as preloading memory may have effects that depend on access direction. (Now if it were mapped memory you might be able to observe increasing access is normally faster than decreasing access on rotating head disks; but that too should not exist on SSDs).
Note the “if you can still measure it”; it is getting hard these days; and mostly nobody cares if you can squeeze that tiny bit of performance out of anything but the hottest loops.
Upvotes: -1
Reputation: 223795
But if I access an array with the current value of N, and I count down, I would be accessing the memory in a backwards fashion. Would this be slower and possibly negate all performance benefits of counting down in the first place?
This is going to take some explaining, so I will motivate it with a conclusion first: Whether forward or backward memory accesses are likely to differ in time depends primarily on whether they match processor lookahead features and whether those features are symmetric in direction.
In regard to access direction, the primary driver of memory access time is processor preloading features. Memory hardware generally has no difference in time to access an address that is less than or greater than some other recently accessed address. But processors (generally, not all) do look at address access patterns and attempt to preload memory that is likely to be used soon.
For example, if a process accesses cache lines 13, 14, and 15, the processor may request cache line 16 from memory before the process actually executes any load instruction for it. This gets complicated in several ways.
First, if the process is racing through memory, working on data as fast as it can be read from memory, there may no advantage to processor preloading because the process is requesting the next memory as soon as the processor would try it anyway. This is especially so on superscalar processors that have multiple instructions “in flight” at one time, since that allows “future” load instructions to request memory before prior data-processing instructions are working on the data from prior loads. So processor preloading works best on algorithms that have at least a bit more data processing compared to memory use.
Second, most algorithms do not simply use data in a stream, either forward or backward. The process may be using data from cache lines 13, 14, 15, and so on but also referencing stack locations, and it may be using data from cache lines 13, 14, and 15 but also using data from cache lines 79, 80, 81, and so on. So processor designers may try to recognize consecutive accesses even when they are interrupted by other accesses. And then the processor behavior gets complicated to predict. How many “other” accesses will your process have in between its consecutive accesses? Will that be too many for the processor to deal with?
Third, a regular stream of accesses might not look regular from some points of view. Suppose you have an array of elements with four elements per cache line, and you access a regular sequence of elements: 0, 5, 10, 15, 20, 25, 30, 35, 40,… These are in cache lines 0 (elements 0-3, containing 0), 1 (4-7, containing 5), 2 (8-11), 3 (12-15), 5 (20-23), 6 (24-27), 7 (28-31), 8 (32-35), and 10 (40-43). If the processor is monitoring which cache lines are accessed, not which addresses are accessed, it sees an irregular sequence: 0, 1, 2, 3, 5, 6, 7, 8, 10, which is missing 4 and 9. When the processor saw 0, 1, 2, and 3, it may have preloaded cache line 4, and that wasted time and cache space because 4 is not used. Or, since 4 is not used by the program, the processor might not preload 5.
What this means is that your memory access time is likely to be affected by your data access pattern (what memory it actually accesses, not particularly how the loop is written) and the processor’s lookahead features. And, in answer to your question, whether forward or backward memory accesses are likely to differ in time depends primarily on whether they match processor lookahead features and whether those features are symmetric in direction.
If the processor design recognizes the same (reflected) backward sequences as it does forward sequences, then accessing memory forward or backward should not matter. If the processor design recognizes some forward sequences but not corresponding backward sequences, then code that matches such a forward sequence may perform better than code that matches such a backward sequence.
That said, I am unable to comment on how common it is for processors to have asymmetric lookahead features. There are too many processor models in existence. I have some recollection of working with an Intel processor in which reverse lookahead was a new feature, but I expect that is common now. But whether all the recognized access patterns are symmetric or what portion of processors on the market have symmetric lookahead features, I cannot say.
Upvotes: 0
Reputation: 214770
It doesn't make much sense to speak of optimizations without a specific system in mind. If your definition of "modern" is something high-end with branch prediction and cache, then that's one use case. Another modern use case could be ARM Cortex M0 to M3 which have no such features.
The old "you should always count down" trick dates back some 30 years from an era where compilers were horrible at optimizing code. It's based on the fact that on many systems, compare vs zero is a few clock ticks faster than comparing vs a value. I'd like to confidently say that compilers are smart enough to do that optimization for you nowadays, but I wouldn't assume as much before verifying it.
Benchmarking your two versions with latest gcc for x86_64 gives nearly identical code https://godbolt.org/z/E8Tb9hWKx. One uses add
, one uses sub
but the same cmp
instruction is used either way, so it gave no benefits - it only made the C code obscure. Supposedly up-counting could be more beneficial in terms of data cache usage but I wouldn't assume that it matters here.
However, if we switch the godbolt target to something more archaic like the old AVR, then both versions get optimized to use a subi
for whatever reason (it still has to compare i
in 2 steps since i
is 16 bits) https://godbolt.org/z/K7c7nrMo6. This is an old, slow 8-bitter so here picking the right instruction matters even more and there's no such thing as cache or branch prediction. Obviously the former version is therefore superior (for AVR), since it is the most readable. So the 30 years old trick didn't even matter when using a 30 years old CPU, given that we pick a modern compiler.
Some rules of thumb:
Upvotes: 5
Reputation: 154174
I am trying to write the most fastest code on modern machines.
Use your time and talents wisely.
At least 97% of the time, concerns like this are a waste of time. Code for clarity. Up or down, which one expresses the higher level code's intent best?
Review Is premature optimization really the root of all evil?.
When the order of complexity is the same, best to focus on larger picture concerns and let the compiler handle the small stuff.
Upvotes: 3