Jet Blue
Jet Blue

Reputation: 5291

NASM floating point - invalid combination of opcode and operands

I am trying to compile the following code sample (NASM syntax) from this article on x86 assembly floating point:

;; c^2 = a^2 + b^2 - cos(C)*2*a*b
;; C is stored in ang

global _start

section .data
    a: dq 4.56   ;length of side a
    b: dq 7.89   ;length of side b
    ang: dq 1.5  ;opposite angle to side c (around 85.94 degrees)

section .bss
    c: resq 1    ;the result ‒ length of side c

section .text
    _start:

    fld qword [a]   ;load a into st0
    fmul st0, st0   ;st0 = a * a = a^2

    fld qword [b]   ;load b into st1
    fmul st1, st1   ;st1 = b * b = b^2

    fadd st1, st0   ;st1 = a^2 + b^2

    fld qword [ang] ;load angle into st0
    fcos            ;st0 = cos(ang)

    fmul qword [a]  ;st0 = cos(ang) * a
    fmul qword [b]  ;st0 = cos(ang) * a * b
    fadd st0, st0   ;st0 = cos(ang) * a * b + cos(ang) * a * b = 2(cos(ang) * a * b)

    fsubp st1, st0  ;st1 = st1 - st0 = (a^2 + b^2) - (2 * a * b * cos(ang))
                    ;and pop st0

    fsqrt           ;take square root of st0 = c

    fst qword [c]   ;store st0 in c ‒ and we're done!

When I execute the following command:

nasm -f elf32 cosineSample.s -o cosineSample.o

I get the following error for the line fmul st1, st1:

error: invalid combination of opcode and operands

What do I need to do to resolve this? Do I need to pass special arguments to nasm? Is the code sample wrong?

Upvotes: 1

Views: 927

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 365882

I fixed the code on Wikibooks and added some extra comments (Jester's answer is good), so now it assembles and runs correctly (tested with GDB, single-stepping with layout ret / tui reg float). This is the diff between revisions. The revision that introduced the fmul st1,st1 invalid instruction bug is here, but even before that it failed to clear the x87 stack when it was done.


Just for fun, I wanted to write a more efficient version that only loads a and b once.

And which allows for more instruction-level parallelism by doing everything not involving the cos result first. i.e. prepare 2*a*b before multiplying that by cos(ang), so those calculations can both run in parallel. Assuming fcos is the critical path, my version only has one fmul and one fsubp latency from fcos result to fsqrt input.

default rel   ; in case we assemble this in 64-bit mode, use RIP-relative addressing

  ... declare stuff, omitted.

    fld    qword [a]   ;load a into st0
    fld    st0         ;   st1 = a  because we'll need it again later.
    fmul   st0, st0    ;st0 = a * a = a^2

    fld    qword [b]   ;load b into st0   (pushing the a^2 result up to st1)
    fmul   st2, st0    ;   st2 = a*b
    fmul   st0, st0    ;st0 = b^2,   st1 = a^2,  st2 = a*b

    faddp              ;st0 = a^2 + b^2   st1 = a*b;        st2 empty
    fxch   st1         ;st0 = a*b         st1 = a^2 + b^2    ;  could avoid this, but only by using cos(ang) earlier, worse for critical path latency
    fadd   st0,st0     ;st0 = 2*a*b       st1 = a^2 + b^2

    fld    qword [ang]
    fcos               ;st0 = cos(ang)       st1 = 2*a*b       st2 = a^2+b^2
    fmulp              ;st0=cos(ang)*2*a*b   st1 = a^2+b^2

    fsubp  st1, st0    ;st0 = (a^2 + b^2) - (2 * a * b * cos(ang))
    fsqrt              ;take square root of st0 = c

    fstp   qword [c]   ;store st0 in c and pop, leaving the x87 stack empty again ‒ and we're done!

Of course, x87 is pretty much obsolete. On modern x86, normally you'd use SSE2 scalar (or packed!) for anything floating-point.

x87 has 2 things going for it on modern x86: 80-bit precision in hardware (vs. 64-bit double), and it's good for small code-size (machine-code bytes, not number of instructions or source size). Good instruction-caches usually mean that code-size is not an important enough factor to make x87 worth it for performance of FP code, because it's generally slower than SSE2 because of extra instructions dealing with the clunky x87 stack.

And for beginners, or code-size reasons, x87 has transcendental functions like fcos and fsin, and log/exp built-in as single instructions. They're microcoded with many uops, and probably not faster than a scalar library function, but on some CPUs you might be ok with the speed / precision tradeoff they make, and the absolute speed. At least if you're using x87 in the first place, otherwise you have to bounce the results to/from XMM registers with a store/reload.

Range-reduction for sin/cos doesn't do any extended-precision stuff to avoid huge relative errors very close to a multiple of Pi, just using the internal 80-bit (64-bit significand) value of Pi. (A library implementation might or might not do that, depending on desired speed vs. precision tradeoff.) See Intel Underestimates Error Bounds by 1.3 quintillion.

(And of course x87 in 32-bit code gives you compatibility with Pentium III and other CPUs that didn't have SSE2 for double, only SSE1 for float or no XMM registers at all. x86-64 has SSE2 as baseline so this advantage doesn't exist on x86-64.)

For beginners, the huge downside to x87 is keeping track of the x87 stack registers, and not letting stuff accumulate. You can easily end up with code that works once, but then gives NaN when you put it in a loop because you didn't balance your x87 stack ops.

extern cos
global cosine_law_sse2_scalar
cosine_law_sse2_scalar:
    movsd   xmm0, [ang]
    call    cos           ; xmm0 = cos(ang).  Avoid using this right away so OoO exec can do the rest of the work in parallel

    movsd   xmm1, [a]
    movsd   xmm2, [b]

    movaps  xmm3, xmm1                ; copying registers should always copy the full reg, not movsd merging into the old value.
    mulsd   xmm3, xmm2   ; xmm3 = a*b

    mulsd   xmm1, xmm1   ; a^2
    mulsd   xmm2, xmm2   ; b^2

    addsd   xmm3, xmm3   ; 2*a*b

    addsd   xmm1, xmm2   ; a^2 + b^2
    mulsd   xmm3, xmm0   ; 2*a*b*cos(ang)
    subsd   xmm1, xmm3   ; (a^2 + b^2) - 2*a*b*cos(ang)

    sqrtsd  xmm0, xmm3   ; sqrt(that), in xmm0 as a return value
    ret
;; This has the work interleaved more than necessary for most CPUs to find the parallelism

This version only has 11 uops after the call cos returns. (https://agner.org/optimize/). It's pretty compact, and pretty simple. No keeping track of the x87 stack. And it has the same nice dependency chains as x87, not using the cos result until we already have 2*a*b.

We can even play around with loading a and b together, as one 128-bit vector. But then unpacking it to do different things with the two halves, or to get b^2 from the top element as a scalar, is clumsy. If SSE3 haddpd was only 1 uop that would be great (and let us do a*b + a*b and a^2 + b^2 with one instruction, given the right inputs), but on all CPUs that have it it's 3 uops.

(PS vs. PD only matter for actual math instructions like MULSS/SD. For FP shuffles and register copies, just use whichever FP instruction gets your data where you want it, with a preference for PS/SS because they have shorter encodings in machine code. That's why I used movaps; movapd is always a missed optimization wasting 1 byte, unless you're making instructions longer on purpose for alignment.)

;; I didn't actually end up using SSE3 for movddup or haddpd, it turned out I couldn't save uops that way.
global cosine_law_sse3_less_shuffle
cosine_law_sse3_less_shuffle:
   ;; 10 uops after the call cos, if both extract_high_half operations use pshufd or let movhlps have a false dependency
   ;; or if we had AVX for  vunpckhpd  xmm3, xmm1,xmm1
   ;; and those 10 are a mix of shuffle and MUL/ADD.
    movsd   xmm0, [ang]
    call    cos           ; xmm0 = cos(ang).  Avoid using this right away so OoO exec can do the rest of the work in parallel

    movups  xmm1, [a]     ; {a, b}  (they were in contiguous memory in this order.  low element = a)
    movaps  xmm3, xmm1

   ; xorps   xmm3, xmm3   ; break false dependency by zeroing.  (xorps+movhlps is maybe better than movaps + unpckhpd, at least on SnB but maybe not Bulldozer / Ryzen)
   ; movhlps xmm3, xmm1   ; xmm3 = b
;   pshufd  xmm3, xmm1, 0b01001110   ; xmm3 = {b, a}  ; bypass delay on Nehalem, but fine on most others

    mulsd   xmm3, [b]    ; xmm3 = a*b   ; reloading b is maybe cheaper than shufling it out of the high half of xmm1
    addsd   xmm3, xmm3   ; 2*b*a
    mulsd   xmm3, xmm0   ; 2*b*a*cos(ang)

    mulpd   xmm1, xmm1   ; {a^2, b^2}

    ;xorps  xmm2, xmm2   ; we don't want to just use xmm0 here; that would couple this dependency chain to the slow cos(ang) critical path sooner.
    movhlps xmm2, xmm1
    addsd   xmm1, xmm2   ; a^2 + b^2

    subsd   xmm1, xmm3   ; (a^2 + b^2) - 2*a*b*cos(ang)

    sqrtsd  xmm0, xmm1   ; sqrt(that), in xmm0 as a return value
    ret

And we can do even better with AVX, saving a MOVAPS register copy because 3-operand non-destructive VEX versions of instructions lets us put the result in a new register, not destroying either of the inputs. This is really great for FP shuffles, because SSE* doesn't have any copy-and-shuffle for FP operands, only pshufd which can cause extra bypass latency on some CPUs. So it saves the MOVAPS and the (commented-out) XORPS that breaks a dependency on whatever produced the old value of XMM2 for MOVHLPS. (MOVHLPS replaces the low 64 bits of the destination with the high 64 of the src, so it has an input dependency on both registers).

global cosine_law_avx
cosine_law_avx:
   ;; 9 uops after the call cos.  Reloading [b] is good here instead of shuffling it, saving total uops / instructions
    vmovsd   xmm0, [ang]
    call     cos           ; xmm0 = cos(ang).  Avoid using this right away so OoO exec can do the rest of the work in parallel

    vmovups  xmm1, [a]     ; {a, b}  (they were in contiguous memory in this order.  low element = a)

    vmulsd   xmm3, xmm1, [b]  ; xmm3 = a*b

    vaddsd   xmm3, xmm3   ; 2*b*a.   (really vaddsd xmm3,xmm3,xmm3  but NASM lets us shorten when dst=src1)
    vmulsd   xmm3, xmm0   ; 2*b*a*cos(ang)

    vmulpd   xmm1, xmm1   ; {a^2, b^2}

    vunpckhpd xmm2, xmm1,xmm1  ; xmm2 = { b^2, b^2 }
    vaddsd   xmm1, xmm2   ; a^2 + b^2

    vsubsd   xmm1, xmm3   ; (a^2 + b^2) - 2*a*b*cos(ang)

    vsqrtsd  xmm0, xmm1,xmm1   ; sqrt(that), in xmm0 as a return value.  (Avoiding an output dependency on xmm0, even though it was an ancestor in the dep chain.  Maybe lets the CPU free that physical reg sooner)
    ret

I only tested the first x87 version, so it's possible I may have missed a step in one of the others.

Upvotes: 1

Jester
Jester

Reputation: 58822

That code is broken unfortunately. fmul can not operate on st1, st1 but even if it did, it wouldn't do what the author wanted. As per the comment, he wanted to calculate b*b but b is in st0 at that point. The comment load b into st1 is wrong, fld always loads into st0 (the top of the stack). You need to change the fmul st1, st1 to fmul st0, st0. Furthermore, to get correct result, the following fadd st1, st0 has to be reversed as well. The code also leaves the fpu stack dirty.

Also note that program has no ending, so it will segfault unless you add an explicit exit system call.

Here is the fixed code, converted to gnu assembler syntax:

.intel_syntax noprefix

.global _start

.data
    a: .double 4.56   # length of side a
    b: .double 7.89   # length of side b
    ang: .double 1.5  # opposite angle to side c (around 85.94 degrees)

.lcomm c, 8

.text
    _start:

    fld qword ptr [a]   # load a into st0
    fmul st             # st0 = a * a = a^2

    fld qword ptr [b]   # load b into st0
    fmul st             # st0 = b * b = b^2

    faddp               # st0 = a^2 + b^2

    fld qword ptr [ang] # load angle into st0
    fcos                # st0 = cos(ang)

    fmul qword ptr [a]  # st0 = cos(ang) * a
    fmul qword ptr [b]  # st0 = cos(ang) * a * b
    fadd st             # st0 = cos(ang) * a * b + cos(ang) * a * b = 2(cos(ang) * a * b)

    fsubp               # st1 = st1 - st0 = (a^2 + b^2) - (2 * a * b * cos(ang))
                        # and pop st0

    fsqrt               # take square root of st0 = c

    fstp qword ptr [c]  # store st0 in c - and we're done!

    # end program
    mov eax, 1
    xor ebx, ebx
    int 0x80

Upvotes: 3

Related Questions