tsybaya
tsybaya

Reputation: 137

How to get lower and higher 32 bits of a 64-bit integer for gcc inline asm? (ARMV5 platform)

I have a project on armv5te platform, and I have to rewrite some functions and use assembly code to use enhancement DSP instructions. I use a lot of int64_t type for accumulators, but I do not have an idea how to pass it for arm instruction SMULL (http://www.keil.com/support/man/docs/armasm/armasm_dom1361289902800.htm).

How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).

I know, that compiler would do it for me, but I just write the small function for an example.

int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
   int64_t tmp_acc;

   asm("SMULL %0, %1, %2, %3"
      : "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;
      : "r"(x), "r"(y)
      );

return tmp_acc + acc;
}

Upvotes: 2

Views: 2410

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 365781

You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:

int64_t accum(int64_t acc, int32_t x, int32_t y) {
    return acc + x * (int64_t)y;
}

which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)

accum:
    smlal   r0, r1, r3, r2        @, y, x
    bx      lr  @

As a bonus, this pure C also compiles efficiently for AArch64.

https://gcc.gnu.org/wiki/DontUseInlineAsm


If you insist on shooting yourself in the foot and using inline asm:

Or in the general case with other instructions, there might be a case where you'd want this.

First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.

This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.

Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.

So you can specify a 64-bit input or output as a pair of 32-bit variables.

#include <stdint.h>

int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
   uint32_t prod_lo, prod_hi;

   asm("SMULL %0, %1, %2, %3"
      : "=&r" (prod_lo), "=&r"(prod_hi)  // early clobber for pre-ARMv6
      : "r"(x), "r"(y)
      );

    int64_t prod = ((int64_t)prod_hi) << 32;
    prod |= prod_lo;        // + here won't optimize away, but | does, with gcc
    return acc + prod;
}

Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.

Again from Godbolt:

@ gcc -O3 output with early-clobber, valid even before ARMv6
testFunc:
    str     lr, [sp, #-4]!    @,         Save return address (link register)
    SMULL ip, lr, r2, r3    @ prod_lo, prod_hi, x, y
    adds    r0, ip, r0      @, prod, acc
    adc     r1, lr, r1        @, prod, acc
    ldr     pc, [sp], #4      @          return by popping the return address into PC


@ gcc -O3 output without early-clobber (&) on output constraints:
@ valid only for ARMv6 and later
testFunc:
    SMULL r3, r2, r2, r3    @ prod_lo, prod_hi, x, y
    adds    r0, r3, r0      @, prod, acc
    adc     r1, r2, r1        @, prod, acc
    bx      lr  @

Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.

// using an int64_t directly with inline asm, using %Q0 and %R0 constraints
// Q is the low half, R is the high half.
int64_t testFunc2(int64_t acc, int32_t x, int32_t y)
{
   int64_t prod;    // gcc and clang seem to want more free registers this way

   asm("SMULL %Q0, %R0, %1, %2"
      : "=&r" (prod)         // early clobber for pre-ARMv6
      : "r"(x), "r"(y)
      );

    return acc + prod;
}

again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)

@ gcc -O3 with the early-clobber so it's safe on ARMv5
testFunc2:
    push    {r4, r5}        @
    SMULL r4, r5, r2, r3    @ prod, x, y
    adds    r0, r4, r0      @, prod, acc
    adc     r1, r5, r1        @, prod, acc
    pop     {r4, r5}  @
    bx      lr  @

So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.

Upvotes: 4

Related Questions