Cygwin: Compile cpp file with asm tag

Question

I'm new with assembly and currently trying to create a c++ code with asm tag. I'm using cygwin to compile. Here is my code:

#include 
using namespace std;

int main()  
{  
    float flp1_num, flp2_num, flp_rslt1;

    cin >>flp1_num >>flp2_num;

    __asm
    {
        FLD flp1_num
        FLDPI
        FADD flp2_num
        FST flp_rslt1
    }

    cout << flp_rslt1;
}

The syntax is used is from here.

I'm compiling using g++ arq.cpp -o arq.exe which gives me error saying:

arq.cpp: In function ‘int main()’:
arq.cpp:13:5: error: expected ‘(’ before ‘{’ token
     {
     ^
arq.cpp:14:9: error: ‘FLD’ was not declared in this scope
         FLD flp1_num
         ^

Then I tried changing __asm {} into __asm() and it gave me different error:

arq.cpp: In function ‘int main()’:
arq.cpp:14:9: error: expected string-literal before ‘FLD’
         FLD flp1_num

I've searched around and found few alternatives that may work, but they didn't work for me. For example both __asm__("fld flp1_num"); and asm("fld flp1_num"); give me error saying /tmp/cccDDfUP.o:arq.cpp:(.text+0x32): undefined reference to flp1_num.

How do I fix this error?

Cody Gray · Accepted Answer

As others have said, you are looking at Microsoft's documentation for their compiler, which has a very different form of inline assembly than the one used by GCC. In fact, it is a substantially less powerful form, in many ways, although it does have the saving grace of being much easier to learn to use.

You will need to consult the documentation for the Gnu inline assembly syntax, available here. For a gentler introduction, there is a good tutorial here, and I particularly like David Wohlferd's answer here. Although it is to an unrelated question, he gives a very good introduction to the basics of inline assembly if you just follow along with his explanation for the sake of it.

Anyway, on to your specific problem. A couple of immediate issues:

The code very likely does not do what you think it does. What your code actually does is add pi to flp2_num, and then put that result into flp_rslt1. It doesn't do anything at all with flp1_num.

If I had to guess, I would imagine that you want to add flp1_num, pi, and flp2_num all together, and then return the result in flp_rslt1. (But maybe not; it isn't really clear, since you don't have any comments stating your intent, nor a descriptive function name.)
Your code is also broken because it does not properly clean up the floating-point stack. You had two "load" instructions, but no pop instructions! Everything you push/load onto the floating point stack must be popped/unloaded, or you imbalance the floating-point stack, which causes major problems.

Therefore, in the MSVC syntax, your code should have looked something like the following (wrapped up into a function for convenience and clarity):

float SumPlusPi(float flp1_num, float flp2_num)
{
    float flp_rslt1;
    __asm
    {
       fldpi                       ; load the constant PI onto the top of the FP stack
       fadd  DWORD PTR [flp2_num]  ; add flp2_num to PI, and leave the result on the top of the stack
       fadd  DWORD PTR [flp1_num]  ; add flp1_num to the top of the stack, again leaving the result there
       fstp  DWORD PTR [flp_rslt1] ; pop the top of the stack into flp_rslt1
    }
    return flp_rslt1;
}

I only pushed one time (fldpi), so I only popped one time (fstp). For the additions, I used the form of fadd that works on a memory operand; this causes the value to be implicitly loaded onto the stack, but otherwise appears to execute as a single instruction. There are, however, many different ways you could have written this. The important thing is to balance the number of pushes with the number of pops. There are instructions that explicitly pop (fstp), and there are other instructions that perform an operation and then pop (e.g., faddp). Different combinations of instructions, in certain orders, are very likely more optimal than others, but my code above does work.

And here is the equivalent code translated into GAS syntax:

float SumPlusPi(float flp1_num, float flp2_num)
{
    float flp_rslt1;
    __asm__("fldpi        
	"
            "faddl %[two] 
	"
            "faddl %[one]"
           : [result] "=t" (flp_rslt1)   // tell compiler result is left at the top of the floating-point stack,
                                         //  making an explicit pop unnecessary
           : [one]    "m" (flp1_num),    // input operand from memory (inefficient)
             [two]    "m" (flp2_num));   // input operand from memory (inefficient)
    return flp_rslt1;
}

Although this works, it is also sub-optimal because it does not take advantage of the advanced features of the GAS inline assembly syntax, particularly the ability to consume values already loaded onto the floating-point stack as inputs.

Most importantly, though, don't miss the reasons why you should not use inline assembly (also by David Wohlferd)! This is a truly pointless usage of inline assembly. The compiler will generate better code, and it will require significantly less work on your part. Therefore, prefer to write the above function like this:

#include     // for M_PI constant

float SumPlusPi(float flp1_num, float flp2_num)
{
    return (flp1_num + flp2_num + static_cast(M_PI));
}

Notice that if you actually want to implement different logic than I had been assuming, it is trivial to alter this code to do what you want.

In case you don't believe me that this produces code that is equally good as your inline assembly—if not better—here is the exact object code generated by GCC 6.2 for the above function (Clang emits the same code):

fld     DWORD PTR [flp2_num]  ; load flp2_num onto top of FPU stack
fadd    DWORD PTR [flp1_num]  ; add flp1_num to value at top of FPU stack
fadd    DWORD PTR [M_PI]      ; add constant M_PI to value at top of FPU stack
ret                           ; return, with result at top of FPU stack

There is no speed win in using fldpi versus loading the value from a constant like GCC does. If anything, forcing the use of this instruction is actually a pessimization, because it means your code cannot ever take advantage of the SSE/SSE2 instructions that allow manipulating floating-point values far more efficiently than the old x87 FPU. Enabling SSE/SSE2 for the above C code is as simple as throwing a compiler switch (or specifying a target architecture that supports it, which will implicitly enable it). That will give you the following:

sub       esp, 4                      ; reserve space on the stack    
movss     xmm0, DWORD PTR [M_PI]      ; load M_PI constant
addss     xmm0, DWORD PTR [flp2_num]  ; add flp2_num
addss     xmm0, DWORD PTR [flp1_num]  ; add flp1_num
movss     DWORD PTR [esp], xmm0       ; store result in temporary space on stack
fld       DWORD PTR [esp]             ; load result from stack to top of FPU stack
add       esp, 4                      ; clean up stack space
ret                                   ; return, with result at top of FPU stack

Cygwin: Compile cpp file with asm tag

Answers (2)

Related Questions