Reputation:
I am learning x86 assembly from the book Programming Ground Up. When introducing functions, the author gives example of a function that raises a given 4 byte integer to a power greater than 0. Here is how the function is defined (this is the version I wrote but code is almost identical):
1. .type find_pow, @function
2. find_pow:
3. pushl %ebp # Save current base pointer
4. movl %esp, %ebp # Copy stack pointer to base pointer
5. movl $1, %eax # %eax will hold the result, set it to 1
6.
7. subl $4, %esp
8. movl 8(%ebp), %ebx
10. movl 12(%ebp), %ecx
11.
12. movl %ebx, -4(%ebp)
13.
14. loop_find_pow: # Start loop
15. cmpl $1, %ecx # If 2nd parameter equals 0
16. je exit_find_now # Exit loop
17. movl -4(%ebp), %eax # Move current result to %eax
18. imull %ebx, %eax # Multiply to current value of %eax
19. movl %eax, -4(%ebp) # Store current result
20. decl %ecx # Decrease %ecx
21. jmp loop_find_pow # Jump to loop top
22.
23. exit_find_now:
24. movl -4(%ebp), %eax
25. movl %ebp, %esp
26. popl %ebp
27. ret
I understand the code completely but a little confused about why the author is doing what he's doing in lines 17...19. First, the value of the local variable is copied to %eax
from stack. Then the computation is done using %eax
(I suppose this is for performance reasons?) and then the calculated value is stored back to space allocated for the only local variable. I fail to understand why all this copying is required. The caller obtains the return value by examining the %eax
register anyway. Doesn't it make sense to completely eliminate use of the local variable (at address -4(%ebp)
?
Edit: If anyone wants to look at the original code in the book, browse to page 67.
Upvotes: 2
Views: 309
Reputation: 363980
Your version destroys the caller's ebx
value. That register is call-preserved in the i386 System V ABI (which Linux uses). Pick a different tmp reg like EDX.
EAX, ECX, and EDX are call-clobbered in all the standard 32-bit calling conventions so those are the registers you can use for scratch space without saving/restoring.
Then the computation is done using
%eax
(I suppose this is for performance reasons?
No, it's because imul
doesn't have a memory-destination encoding, only r/m
source.
If you cared about performance, you wouldn't introduce store/reload latency into the dependency chain for the product you're computing. Even keeping the loop counter in memory would be better (for large counts at least) because dec
latency is shorter than imul
latency.
In your loop, one input for each iteration comes from the previous iteration. The total latency of store/load/imul / store/load/imul / ... is what bottlenecks your loop on an out-of-order execution CPU (all current x86 CPUs are like that). Store-forwarding makes store/reload only about 5 cycles, but that's still longer than imul
latency on most CPUs.
If you're running low on registers and want to keep one of your variables in memory, generally prefer spilling a variable that only read inside a loop, not written. Re-reading the same memory location multiple times is fine.
In compiler/optimization terminology, "spilling" a variable from a register is when you can't do the preferred thing of keeping it in a register, and instead of to store it to memory.
e.g. imul 8(%ebp), %eax
as your loop body.
You also didn't need to copy to -4(%ebp)
, you could have just modified 8(%ebp)
where you already had this variable in memory.
According to the calling convention you "own" your stack args and can modify that copy of your variables. So even if you did want to modify a value in memory, you still didn't need to copy it.
Your loop structure is also inefficient, although the 3-cycle latency of imul
(or worse on some older CPUs) will easily hide the overhead of running all those extra instructions vs. a simple dec %ecx
/ jnz loop_find_pow
at the bottom of your loop. Why are loops always compiled into "do...while" style (tail jump)?
Look at optimized compiler output for a C version of your function!
How to remove "noise" from GCC/clang assembly output?
(update: this could be simplified further because we're guaranteed that n>0
. So we could just directly use a do{}while(--n);
loop structure. I missed that in the question originally and will leave that simplification as an exercise for the reader.)
unsigned find_pow(unsigned x, unsigned n) {
unsigned tot = x; // or should this be 1? Possible off-by-1 error
for(; n ; n--) { // gcc fails to convert i=0 ; i<n ; i++ into a down-count
tot *= x;
}
return tot;
}
GCC 9.1 -O3 -m32 -fno-tree-vectorize -march=skylake
on Godbolt
find_pow:
movl 4(%esp), %ecx
movl 8(%esp), %eax
movl %ecx, %edx
testl %eax, %eax # if(n==0) return
je .L1
.L3: # do{
imull %ecx, %edx
decl %eax
jne .L3 # }while(--n);
.L1:
movl %edx, %eax # silly compiler, should have used tot=EAX in the first place
ret
clang unrolls (unless you use -fno-unroll-loops
). But it doesn't use multiple accumulators for the product to hide imul
latency /facepalm. imul
is pipelined with better throughput than latency on modern x86 CPUs, typically 3 cycle latency / 1 cycle throughput. See https://agner.org/optimize/.
But clang just repeats imul %ecx, %eax
8 times in its main loop which is basically useless. Multiplying into %edx
and %eax
in parallel, and then multiplying those at the end, would let two chains of imul
run in parallel. Use at least 3 registers for best performance with large n
. But practical values of n
are relatively small (like even 2**32
will overflow a 32-bit integer), so large-n
optimizations are only interesting if you want to use this to get the low bits of (x**n) % (2**32)
.
I think that's a safe optimization even with wraparound at 2^32. GCC auto-vectorizes with AVX2 vpmulld
if available (-march=skylake
) which wouldn't be safe if integer multiply wasn't associative.
Clang does unroll with multiple accumulators does when optimizing FP loops with -ffast-math
(to make it treat FP add/mul as associative) so it's too bad it isn't doing that with integer math. Unrolling by 8 only makes sense if the compiler doesn't realize that n
is probably usually small, so if clang is going to optimize for large n
, do it right and hide imul
latency.
When auto-vectorizing with AVX2, clang does get this somewhat right and uses 4 vectors of totals in parallel. But vpmulld
has 10 cycle latency on Haswell/Skylake and clang unrolls by 32, not just 4. Seems like way too much.
Upvotes: 2
Reputation: 137398
Doesn't it make sense to completely eliminate use of the local variable (at address -4(%ebp)?
Yes.
Because eax
is a volatile register (defined by the ABI), the function need not preserve its original value.
An algorithm this simple can keep all of its state in registers, without having to spill them onto the stack.
However, this is a simple example intended to teach; one of the things it demonstrates is how to use stack variables when you don't have enough registers.
Upvotes: 5