Reputation: 25
I'm trying to optimize a mathematical function writing inline assembly with GCC and ARM Cortex-A7. My code is this:
__inline int __attribute__((pcs("aapcs"))) optAbsVal(int x)
{
asm("CMP R0, #0\n"
"IT LT\n"
"RSBLT R0, R0, #0");
return(x);
}
I did not specify any input/output parameters nor the clobbers inside the inline asm block because, according to the calling convention, x should be in R0, as well as the return value. The problem is this function returns the value of x without modifying it, which makes me think either x is not in R0 or the compiler modifies in some way the function. I resolved this by adding the parameters "=r"(x) : "0"(x), but still I'm not satisfied with this code as it seems I'm doing unnecessary operations. The reason I'm doing pcs("aapcs") is to avoid load/store ops to get better performances, but this is getting worse instead.
Upvotes: 1
Views: 381
Reputation: 364200
Inline asm is completely pointless here. GCC already knows how to optimize an absolute-value, and hiding that process from the compiler inside inline asm will make your code optimize worse, not better. https://gcc.gnu.org/wiki/DontUseInlineAsm
Writing absolute value in pure C is always at least as good (unless the compiler decides to make branchy code after inlining into something, and profiling shows that branching was the wrong choice.)
absval(int x) {
return x<0 ? -x : x; // ternary often compiles branchlessly
}
The advantages over inline asm include: the compilers knows the result is non-negative, and can optimize accordingly. For example, it can divide by a power of 2 with a simple right shift, instead of needing to account for the different rounding of shifts vs. C signed division:
void foo_asm (int *arr, int len) {
for (int i=0 ; i<1024 ; i++){
arr[i] = optAbsVal(arr[i]) / 4; // Using Ross's correct implementation
}
}
inner loop (from gcc6.3 -O3 -mcpu=cortex-a7 -mthumb
on the Godbolt compiler explorer):
.L4:
ldr r3, [r2, #4]
CMP r3, #0 @@@@ Inline asm version
IT LT
RSBLT r3, r3, #0
adds r1, r3, #3
bics r3, r3, r3, asr #32
it cs
movcs r3, r1 @ x = x<0 ? x+3 : x (I think, I didn't look up BICS)
asrs r3, r3, #2 @ x >>= 2
str r3, [r2, #4]!
cmp r2, r0
bne .L4
vs.
void foo_pure (int *arr, int len) {
for (int i=0 ; i<1024 ; i++){
arr[i] = absval(arr[i]) / 4; // Using my pure C
}
}
.L8: @@@@@@@@ Pure C version
ldr r3, [r2, #4]
cmp r3, #0 @ gcc emitted exactly your 3-insn sequence on its own
it lt
rsblt r3, r3, #0
asrs r3, r3, #2 @ non-negative division by 4 is a trivial >> 2
str r3, [r2, #4]!
cmp r1, r2
bne .L8
Knowing that a signed variable is non-negative is often valuable for a compiler. (And signed overflow is undefined behaviour, so it's allowed to ignore the fact that 0 - 0x80000000
= 0x80000000
, i.e. that -INT_MIN
still has its sign bit set, because -INT_MIN
is UB. The most negative number is a special case for 2's complement.)
gcc could be doing even better by looking at flags already set by previous instructions instead of doing a cmp
. (This could also allow better instruction scheduling for in-order cores).
But for absval(100 + arr[i])
I'm seeing
adds r3, r3, #100
cmp r3, #0
it lt
rsblt r3, r3, #0
instead of using the sign flag alone for the MInus condition.
@ hand-written, IDK why gcc doesn't do this, probably missed optimization:
adds r3, r3, #100 # set flags
it MI # use the MInus condition instead of LessThan
rsbmi r3, r3, #0
Inline asm also fails to take advantage of ARM's 3-operand instructions. rsb
could produce the result in a different register than the input (in ARM mode at least, and in unified syntax IT
doesn't require thumb mode). But you can't just use a separate output operand for x
if you want your asm to still assemble in Thumb mode, where rsb r1, r0, #0
wouldn't assemble.
And also, inline asm blocks constant-propagation. optAbsVal(-1)
compiles to 4 instructions to flip it at run-time. absval(-1)
compiles to a compile-time constant of 1
.
On targets with NEON, inline-asm also can't auto-vectorize. It may also make the compiler not unroll a loop when it otherwise would have.
Upvotes: 1
Reputation: 39581
Since x
isn't the return value it doesn't need to be in R0. The return value is the result of the evaluating the expression given in the return
statement. So with return x
the return value isn't x
, the return value is the value of x
. This is important distinction because that this means x
doesn't need to live in R0, only that the value in x
in needs to be copied into R0 before the function returns.
So since the last statement to be executed in your function is return (x);
then that means the last thing your function does is copy x
to R0, which clobbers the value you stored in R0 in your inline assembly statement.
This is why you must always fully describe the effect on the machine state your inline assembly statements. The compiler has no idea you want the value in R0 preserved. It has no idea you expect the value passed in the x
paramater to be in R0 on entry to the asm statement. That might be true because of the calling convention, but the rules of the calling convention only apply at entry and exit to a function, not in the middle of a function where your asm statement is. If your function is inlined into another function then the calling convention doesn't apply at all since there's no actual function call.
So what you want is something like this:
__inline int optAbsVal(int x)
{
asm("CMP %0, #0\n"
"IT LT\n"
"RSBLT %0, %0, #0"
: "+r" (x) : : "cc");
return(x);
}
Upvotes: 3