Reputation:
For example:
; Method 1
.data
val1 DWORD 10000h
.code
add eax,val1
v.s:
; Method 2
.code
add eax,10000h
Which method would execute faster after being compiled (assembled)? I'm thinking method 2 would produce faster code because the CPU won't have to read value from main memory before adding up to eax register. I'm not so clear in my answer, could somebody help?
Upvotes: 4
Views: 1516
Reputation: 471299
In all likelihood, it will be situation dependent and the difference may not even be noticeable.
Factors such as out-of-order execution will likely hide any sort of inherent "slowness" of either version unless there actually is a bottleneck.
That said, if we had to pick which is faster, then you are correct that the second case is likely to be faster.
If we look at Agner Fog's tables for all the current x86 processors:
Core 2:
add/sub r, r/i Latency = 1 , 1/Throughput = 0.33
add/sub r, m Latency = unknown , 1/Throughput = 1
Nehalem:
add/sub r, r/i Latency = 1 , 1/Throughput = 0.33
add/sub r, m Latency = unknown , 1/Throughput = 1
Sandy Bridge:
add/sub r, r/i Latency = 1 , 1/Throughput = 0.33
add/sub r, m Latency = unknown , 1/Throughput = 0.5
K10:
add/sub r, r/i Latency = 1 , 1/Throughput = 0.33
add/sub r, m Latency = unknown , 1/Throughput = 0.5
In all cases, the memory operand version has less throughput. The latency is unknown in all cases, but almost certain to be more than 1 cycle. So it's worse in all factors.
The memory operand versions use all the same execution ports as the immediate version + it also needs a memory read port. This can only make the situation worse. In fact, this is why the throughputs are lower with the memory operand - the memory ports can only sustain 1 or 2 reads/cycle whereas the adder can sustain a full 3/cycle.
Furthermore this assumes that the data is in L1 cache. If it isn't, then the memory operand version will be MUCH slower.
Taking this one step further, we can examine the size of the encoded instructions:
add eax,val1 -> 03 05 14 00 00 00
add eax,10000h -> 05 00 00 01 00
The encoding for the first one may be slightly different depending on the address of val1
. The examples I've shown here are from my particular test case.
So the memory access version needs an extra byte to encode - which means a slightly larger code-size - and potentially more i-cache misses at the extreme.
So in conclusion, if there is a performance difference between the versions, it is likely that the immediate will be faster because:
Upvotes: 5
Reputation: 726669
10000h will be read from memory no matter what - either from its location in the data memory, or from its location in the instruction memory. For smaller constant values CPUs provide special instructions that do not require an additional space for the value being added, but this depends on the specific architecture. The add immediate will probably be faster because of caching: by the time the instruction is decoded, the constant will be in cache, and the addition will be very quick.
Small off-topic note: your example shows a case when an optimizing C compiler would produce a faster code than a hand-written assembly: instead of adding 10000h, optimizer may increment the upper half-word by one, and leave the lower half-word as is.
Upvotes: 5
Reputation: 330
I haven't done assembly in a while, but I believe this code is not equivalent.
In method 1, you add the address of val1 to eax, in method 2 you add the constant value 10000h to eax... To add the contents of the variable you would have to do
add eax,[val1]
and this would be slower because it would trigger a memory read. And this code may not even be legal. Shouldn't you do something like:
mov ecx, val1
add eax, [ecx]
As I said, my Intel assembly is pretty rusty :)
Upvotes: 0
Reputation: 5101
Adding an immediate (your magical hex value) is indeed faster (on architectures I'm aware of at least).
I think the question is, how much. Now i reckon this depends on a fact whether val1 is cached or not.
In case It's NOT cached, it's VERY slow since accessing memory is heaps slower than accessing cache (whatever level really, cache l1 is the fastest indeed).
In case it IS indeed cached, the results are in my humblemost opinion pretty close to each other.
Upvotes: 3