ARM Assembly, Evaluate Alternative Latencies, How Close Matters?

Question

When evaluating two alternatives to solve a problem, comparing clock-cycles and latencies, if both evaluate roughly the same, is there a better way to decide which to use?

Example - Converting to Hexstring

An example I was looking at involves converting an integer value to hex for output. There are two basic approaches:

use a simple lookup of the hex-digit from a string "0123456789abcdef" using ldr (roughly 3 clock-cycles); or
compare the remainder with 10 and either add '0' or 'W' which involves a cmp and then two conditional adds (e.g. addlo and addhs) which at roughly 1 clock-cycle each is again about 3 clock-cycles.

(using the rough latencies from the link in the answer from Instruction execution latencies for A53 -- there apparently isn't a good a53 specific latency reference)

Example - Hex Convert Loop Alternatives

The following is code for a cortex-a53 (raspberrypi 3B):

    hexdigits:  .asciz "0123456789abcdef"
    ...
        ldr     r8, hexdigitadr       /* load address for hexdigits */
    ...

    hexcvtloop:
        cmp     r6, 0                 /* separation of digits done? */
        beq     hexcopy               /* copy tmp string to address */
        udiv    r0, r6, r7            /* divide by base, quotient in r0 */
        mls     r2, r0, r7, r6        /* mod (remainder) in r2 */
        mov     r6, r0                /* quotient to value */
        
        /* alternative 1 - ASCII hexdigit lookup */
        ldrb    r2, [r8, r2]          /* hexdigit lookup */
        
        /* alternative 2 - add to obtain ASCII hexdigit */
        cmp     r2, 10                /* compare digit to 10 */
        addlo   r2, r2, '0'           /* convert to digits '0'-'9' */
        addhs   r2, r2, 'W'           /* convert to 'a'-'f' */
        
        strb    r2, [r5], 1           /* store in tmp string */
        b       hexcvtloop

Understanding the reference stated clock-cycles do not account for other factors, interrupts, memory speed, cache-misses, etc..

If my rough estimates of about 3 clock-cycles each for either the hex-digit lookup with ldr or the cmp, addlo, addhs for adding to the remainder is fair, is there another consideration that would decide between the two approaches, or is it basically personal preference at that point?

(I'm not overly concerned with getting a cortex-a53 specific answer, but am more interested in if there are other ARM general metrics I would look to next -- or if it's just "up to you" at this point)

ARM Assembly, Evaluate Alternative Latencies, How Close Matters?

Answers (1)

Code review

Related Questions