CDevel
CDevel

Reputation: 25

GCC inline asm with aapcs

I'm trying to optimize a mathematical function writing inline assembly with GCC and ARM Cortex-A7. My code is this:

__inline int __attribute__((pcs("aapcs"))) optAbsVal(int x)
{
  asm("CMP R0, #0\n"
      "IT LT\n"
      "RSBLT R0, R0, #0");
  return(x);
}

I did not specify any input/output parameters nor the clobbers inside the inline asm block because, according to the calling convention, x should be in R0, as well as the return value. The problem is this function returns the value of x without modifying it, which makes me think either x is not in R0 or the compiler modifies in some way the function. I resolved this by adding the parameters "=r"(x) : "0"(x), but still I'm not satisfied with this code as it seems I'm doing unnecessary operations. The reason I'm doing pcs("aapcs") is to avoid load/store ops to get better performances, but this is getting worse instead.

Upvotes: 1

Views: 381

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364200

Inline asm is completely pointless here. GCC already knows how to optimize an absolute-value, and hiding that process from the compiler inside inline asm will make your code optimize worse, not better. https://gcc.gnu.org/wiki/DontUseInlineAsm

Writing absolute value in pure C is always at least as good (unless the compiler decides to make branchy code after inlining into something, and profiling shows that branching was the wrong choice.)

absval(int x) {
    return x<0 ? -x : x;  // ternary often compiles branchlessly
}

The advantages over inline asm include: the compilers knows the result is non-negative, and can optimize accordingly. For example, it can divide by a power of 2 with a simple right shift, instead of needing to account for the different rounding of shifts vs. C signed division:

void foo_asm (int *arr, int len) {
    for (int i=0 ; i<1024 ; i++){
      arr[i] = optAbsVal(arr[i]) / 4;  // Using Ross's correct implementation
    }
}

inner loop (from gcc6.3 -O3 -mcpu=cortex-a7 -mthumb on the Godbolt compiler explorer):

.L4:
    ldr     r3, [r2, #4]
    CMP r3, #0             @@@@ Inline asm version
IT LT
RSBLT r3, r3, #0
    adds    r1, r3, #3
    bics    r3, r3, r3, asr #32
    it      cs
    movcs   r3, r1           @ x = x<0 ? x+3 : x  (I think, I didn't look up BICS)
    asrs    r3, r3, #2       @ x >>= 2
    str     r3, [r2, #4]!
    cmp     r2, r0
    bne     .L4

vs.

void foo_pure (int *arr, int len) {
    for (int i=0 ; i<1024 ; i++){
      arr[i] = absval(arr[i]) / 4;  // Using my pure C
    }
}

.L8:               @@@@@@@@ Pure C version
    ldr     r3, [r2, #4]
    cmp     r3, #0           @ gcc emitted exactly your 3-insn sequence on its own
    it      lt
    rsblt   r3, r3, #0
    asrs    r3, r3, #2       @ non-negative division by 4 is a trivial >> 2
    str     r3, [r2, #4]!
    cmp     r1, r2
    bne     .L8

Knowing that a signed variable is non-negative is often valuable for a compiler. (And signed overflow is undefined behaviour, so it's allowed to ignore the fact that 0 - 0x80000000 = 0x80000000, i.e. that -INT_MIN still has its sign bit set, because -INT_MIN is UB. The most negative number is a special case for 2's complement.)


gcc could be doing even better by looking at flags already set by previous instructions instead of doing a cmp. (This could also allow better instruction scheduling for in-order cores).

But for absval(100 + arr[i]) I'm seeing

    adds    r3, r3, #100
    cmp     r3, #0
    it      lt
    rsblt   r3, r3, #0

instead of using the sign flag alone for the MInus condition.

    @ hand-written, IDK why gcc doesn't do this, probably missed optimization:
    adds    r3, r3, #100    # set flags
    it      MI              # use the MInus condition instead of LessThan
    rsbmi   r3, r3, #0

Inline asm also fails to take advantage of ARM's 3-operand instructions. rsb could produce the result in a different register than the input (in ARM mode at least, and in unified syntax IT doesn't require thumb mode). But you can't just use a separate output operand for x if you want your asm to still assemble in Thumb mode, where rsb r1, r0, #0 wouldn't assemble.

And also, inline asm blocks constant-propagation. optAbsVal(-1) compiles to 4 instructions to flip it at run-time. absval(-1) compiles to a compile-time constant of 1.

On targets with NEON, inline-asm also can't auto-vectorize. It may also make the compiler not unroll a loop when it otherwise would have.

Upvotes: 1

Ross Ridge
Ross Ridge

Reputation: 39581

Since x isn't the return value it doesn't need to be in R0. The return value is the result of the evaluating the expression given in the return statement. So with return x the return value isn't x, the return value is the value of x. This is important distinction because that this means x doesn't need to live in R0, only that the value in x in needs to be copied into R0 before the function returns.

So since the last statement to be executed in your function is return (x); then that means the last thing your function does is copy x to R0, which clobbers the value you stored in R0 in your inline assembly statement.

This is why you must always fully describe the effect on the machine state your inline assembly statements. The compiler has no idea you want the value in R0 preserved. It has no idea you expect the value passed in the x paramater to be in R0 on entry to the asm statement. That might be true because of the calling convention, but the rules of the calling convention only apply at entry and exit to a function, not in the middle of a function where your asm statement is. If your function is inlined into another function then the calling convention doesn't apply at all since there's no actual function call.

So what you want is something like this:

__inline int optAbsVal(int x)
{
  asm("CMP %0, #0\n"
      "IT LT\n"
      "RSBLT %0, %0, #0"
      : "+r" (x) : : "cc");
  return(x);
}

Upvotes: 3

Related Questions