Reputation: 5291
I am trying to compile the following code sample (NASM syntax) from this article on x86 assembly floating point:
;; c^2 = a^2 + b^2 - cos(C)*2*a*b
;; C is stored in ang
global _start
section .data
a: dq 4.56 ;length of side a
b: dq 7.89 ;length of side b
ang: dq 1.5 ;opposite angle to side c (around 85.94 degrees)
section .bss
c: resq 1 ;the result ‒ length of side c
section .text
_start:
fld qword [a] ;load a into st0
fmul st0, st0 ;st0 = a * a = a^2
fld qword [b] ;load b into st1
fmul st1, st1 ;st1 = b * b = b^2
fadd st1, st0 ;st1 = a^2 + b^2
fld qword [ang] ;load angle into st0
fcos ;st0 = cos(ang)
fmul qword [a] ;st0 = cos(ang) * a
fmul qword [b] ;st0 = cos(ang) * a * b
fadd st0, st0 ;st0 = cos(ang) * a * b + cos(ang) * a * b = 2(cos(ang) * a * b)
fsubp st1, st0 ;st1 = st1 - st0 = (a^2 + b^2) - (2 * a * b * cos(ang))
;and pop st0
fsqrt ;take square root of st0 = c
fst qword [c] ;store st0 in c ‒ and we're done!
When I execute the following command:
nasm -f elf32 cosineSample.s -o cosineSample.o
I get the following error for the line fmul st1, st1
:
error: invalid combination of opcode and operands
What do I need to do to resolve this? Do I need to pass special arguments to nasm
? Is the code sample wrong?
Upvotes: 1
Views: 927
Reputation: 365882
I fixed the code on Wikibooks and added some extra comments (Jester's answer is good), so now it assembles and runs correctly (tested with GDB, single-stepping with layout ret
/ tui reg float
). This is the diff between revisions. The revision that introduced the fmul st1,st1
invalid instruction bug is here, but even before that it failed to clear the x87 stack when it was done.
Just for fun, I wanted to write a more efficient version that only loads a
and b
once.
And which allows for more instruction-level parallelism by doing everything not involving the cos
result first. i.e. prepare 2*a*b
before multiplying that by cos(ang)
, so those calculations can both run in parallel. Assuming fcos
is the critical path, my version only has one fmul
and one fsubp
latency from fcos
result to fsqrt
input.
default rel ; in case we assemble this in 64-bit mode, use RIP-relative addressing
... declare stuff, omitted.
fld qword [a] ;load a into st0
fld st0 ; st1 = a because we'll need it again later.
fmul st0, st0 ;st0 = a * a = a^2
fld qword [b] ;load b into st0 (pushing the a^2 result up to st1)
fmul st2, st0 ; st2 = a*b
fmul st0, st0 ;st0 = b^2, st1 = a^2, st2 = a*b
faddp ;st0 = a^2 + b^2 st1 = a*b; st2 empty
fxch st1 ;st0 = a*b st1 = a^2 + b^2 ; could avoid this, but only by using cos(ang) earlier, worse for critical path latency
fadd st0,st0 ;st0 = 2*a*b st1 = a^2 + b^2
fld qword [ang]
fcos ;st0 = cos(ang) st1 = 2*a*b st2 = a^2+b^2
fmulp ;st0=cos(ang)*2*a*b st1 = a^2+b^2
fsubp st1, st0 ;st0 = (a^2 + b^2) - (2 * a * b * cos(ang))
fsqrt ;take square root of st0 = c
fstp qword [c] ;store st0 in c and pop, leaving the x87 stack empty again ‒ and we're done!
Of course, x87 is pretty much obsolete. On modern x86, normally you'd use SSE2 scalar (or packed!) for anything floating-point.
x87 has 2 things going for it on modern x86: 80-bit precision in hardware (vs. 64-bit double
), and it's good for small code-size (machine-code bytes, not number of instructions or source size). Good instruction-caches usually mean that code-size is not an important enough factor to make x87 worth it for performance of FP code, because it's generally slower than SSE2 because of extra instructions dealing with the clunky x87 stack.
And for beginners, or code-size reasons, x87 has transcendental functions like fcos
and fsin
, and log/exp built-in as single instructions. They're microcoded with many uops, and probably not faster than a scalar library function, but on some CPUs you might be ok with the speed / precision tradeoff they make, and the absolute speed. At least if you're using x87 in the first place, otherwise you have to bounce the results to/from XMM registers with a store/reload.
Range-reduction for sin/cos doesn't do any extended-precision stuff to avoid huge relative errors very close to a multiple of Pi, just using the internal 80-bit (64-bit significand) value of Pi. (A library implementation might or might not do that, depending on desired speed vs. precision tradeoff.) See Intel Underestimates Error Bounds by 1.3 quintillion.
(And of course x87 in 32-bit code gives you compatibility with Pentium III and other CPUs that didn't have SSE2 for double, only SSE1 for float or no XMM registers at all. x86-64 has SSE2 as baseline so this advantage doesn't exist on x86-64.)
For beginners, the huge downside to x87 is keeping track of the x87 stack registers, and not letting stuff accumulate. You can easily end up with code that works once, but then gives NaN when you put it in a loop because you didn't balance your x87 stack ops.
extern cos
global cosine_law_sse2_scalar
cosine_law_sse2_scalar:
movsd xmm0, [ang]
call cos ; xmm0 = cos(ang). Avoid using this right away so OoO exec can do the rest of the work in parallel
movsd xmm1, [a]
movsd xmm2, [b]
movaps xmm3, xmm1 ; copying registers should always copy the full reg, not movsd merging into the old value.
mulsd xmm3, xmm2 ; xmm3 = a*b
mulsd xmm1, xmm1 ; a^2
mulsd xmm2, xmm2 ; b^2
addsd xmm3, xmm3 ; 2*a*b
addsd xmm1, xmm2 ; a^2 + b^2
mulsd xmm3, xmm0 ; 2*a*b*cos(ang)
subsd xmm1, xmm3 ; (a^2 + b^2) - 2*a*b*cos(ang)
sqrtsd xmm0, xmm3 ; sqrt(that), in xmm0 as a return value
ret
;; This has the work interleaved more than necessary for most CPUs to find the parallelism
This version only has 11 uops after the call cos
returns. (https://agner.org/optimize/). It's pretty compact, and pretty simple. No keeping track of the x87 stack. And it has the same nice dependency chains as x87, not using the cos result until we already have 2*a*b
.
We can even play around with loading a
and b
together, as one 128-bit vector. But then unpacking it to do different things with the two halves, or to get b^2
from the top element as a scalar, is clumsy. If SSE3 haddpd
was only 1 uop that would be great (and let us do a*b + a*b
and a^2 + b^2
with one instruction, given the right inputs), but on all CPUs that have it it's 3 uops.
(PS vs. PD only matter for actual math instructions like MULSS/SD. For FP shuffles and register copies, just use whichever FP instruction gets your data where you want it, with a preference for PS/SS because they have shorter encodings in machine code. That's why I used movaps
; movapd
is always a missed optimization wasting 1 byte, unless you're making instructions longer on purpose for alignment.)
;; I didn't actually end up using SSE3 for movddup or haddpd, it turned out I couldn't save uops that way.
global cosine_law_sse3_less_shuffle
cosine_law_sse3_less_shuffle:
;; 10 uops after the call cos, if both extract_high_half operations use pshufd or let movhlps have a false dependency
;; or if we had AVX for vunpckhpd xmm3, xmm1,xmm1
;; and those 10 are a mix of shuffle and MUL/ADD.
movsd xmm0, [ang]
call cos ; xmm0 = cos(ang). Avoid using this right away so OoO exec can do the rest of the work in parallel
movups xmm1, [a] ; {a, b} (they were in contiguous memory in this order. low element = a)
movaps xmm3, xmm1
; xorps xmm3, xmm3 ; break false dependency by zeroing. (xorps+movhlps is maybe better than movaps + unpckhpd, at least on SnB but maybe not Bulldozer / Ryzen)
; movhlps xmm3, xmm1 ; xmm3 = b
; pshufd xmm3, xmm1, 0b01001110 ; xmm3 = {b, a} ; bypass delay on Nehalem, but fine on most others
mulsd xmm3, [b] ; xmm3 = a*b ; reloading b is maybe cheaper than shufling it out of the high half of xmm1
addsd xmm3, xmm3 ; 2*b*a
mulsd xmm3, xmm0 ; 2*b*a*cos(ang)
mulpd xmm1, xmm1 ; {a^2, b^2}
;xorps xmm2, xmm2 ; we don't want to just use xmm0 here; that would couple this dependency chain to the slow cos(ang) critical path sooner.
movhlps xmm2, xmm1
addsd xmm1, xmm2 ; a^2 + b^2
subsd xmm1, xmm3 ; (a^2 + b^2) - 2*a*b*cos(ang)
sqrtsd xmm0, xmm1 ; sqrt(that), in xmm0 as a return value
ret
And we can do even better with AVX, saving a MOVAPS register copy because 3-operand non-destructive VEX versions of instructions lets us put the result in a new register, not destroying either of the inputs. This is really great for FP shuffles, because SSE* doesn't have any copy-and-shuffle for FP operands, only pshufd
which can cause extra bypass latency on some CPUs. So it saves the MOVAPS and the (commented-out) XORPS that breaks a dependency on whatever produced the old value of XMM2 for MOVHLPS. (MOVHLPS replaces the low 64 bits of the destination with the high 64 of the src, so it has an input dependency on both registers).
global cosine_law_avx
cosine_law_avx:
;; 9 uops after the call cos. Reloading [b] is good here instead of shuffling it, saving total uops / instructions
vmovsd xmm0, [ang]
call cos ; xmm0 = cos(ang). Avoid using this right away so OoO exec can do the rest of the work in parallel
vmovups xmm1, [a] ; {a, b} (they were in contiguous memory in this order. low element = a)
vmulsd xmm3, xmm1, [b] ; xmm3 = a*b
vaddsd xmm3, xmm3 ; 2*b*a. (really vaddsd xmm3,xmm3,xmm3 but NASM lets us shorten when dst=src1)
vmulsd xmm3, xmm0 ; 2*b*a*cos(ang)
vmulpd xmm1, xmm1 ; {a^2, b^2}
vunpckhpd xmm2, xmm1,xmm1 ; xmm2 = { b^2, b^2 }
vaddsd xmm1, xmm2 ; a^2 + b^2
vsubsd xmm1, xmm3 ; (a^2 + b^2) - 2*a*b*cos(ang)
vsqrtsd xmm0, xmm1,xmm1 ; sqrt(that), in xmm0 as a return value. (Avoiding an output dependency on xmm0, even though it was an ancestor in the dep chain. Maybe lets the CPU free that physical reg sooner)
ret
I only tested the first x87 version, so it's possible I may have missed a step in one of the others.
Upvotes: 1
Reputation: 58822
That code is broken unfortunately. fmul
can not operate on st1, st1
but even if it did, it wouldn't do what the author wanted. As per the comment, he wanted to calculate b*b
but b
is in st0
at that point. The comment load b into st1
is wrong, fld
always loads into st0
(the top of the stack). You need to change the fmul st1, st1
to fmul st0, st0
. Furthermore, to get correct result, the following fadd st1, st0
has to be reversed as well. The code also leaves the fpu stack dirty.
Also note that program has no ending, so it will segfault unless you add an explicit exit
system call.
Here is the fixed code, converted to gnu assembler syntax:
.intel_syntax noprefix
.global _start
.data
a: .double 4.56 # length of side a
b: .double 7.89 # length of side b
ang: .double 1.5 # opposite angle to side c (around 85.94 degrees)
.lcomm c, 8
.text
_start:
fld qword ptr [a] # load a into st0
fmul st # st0 = a * a = a^2
fld qword ptr [b] # load b into st0
fmul st # st0 = b * b = b^2
faddp # st0 = a^2 + b^2
fld qword ptr [ang] # load angle into st0
fcos # st0 = cos(ang)
fmul qword ptr [a] # st0 = cos(ang) * a
fmul qword ptr [b] # st0 = cos(ang) * a * b
fadd st # st0 = cos(ang) * a * b + cos(ang) * a * b = 2(cos(ang) * a * b)
fsubp # st1 = st1 - st0 = (a^2 + b^2) - (2 * a * b * cos(ang))
# and pop st0
fsqrt # take square root of st0 = c
fstp qword ptr [c] # store st0 in c - and we're done!
# end program
mov eax, 1
xor ebx, ebx
int 0x80
Upvotes: 3