Reputation: 84642
When evaluating two alternatives to solve a problem, comparing clock-cycles and latencies, if both evaluate roughly the same, is there a better way to decide which to use?
Example - Converting to Hexstring
An example I was looking at involves converting an integer value to hex for output. There are two basic approaches:
"0123456789abcdef"
using ldr
(roughly 3 clock-cycles); or10
and either add '0'
or 'W'
which involves a cmp
and then two conditional adds (e.g. addlo
and addhs
) which at roughly 1 clock-cycle each is again about 3 clock-cycles.(using the rough latencies from the link in the answer from Instruction execution latencies for A53 -- there apparently isn't a good a53 specific latency reference)
Example - Hex Convert Loop Alternatives
The following is code for a cortex-a53 (raspberrypi 3B):
hexdigits: .asciz "0123456789abcdef"
...
ldr r8, hexdigitadr /* load address for hexdigits */
...
hexcvtloop:
cmp r6, 0 /* separation of digits done? */
beq hexcopy /* copy tmp string to address */
udiv r0, r6, r7 /* divide by base, quotient in r0 */
mls r2, r0, r7, r6 /* mod (remainder) in r2 */
mov r6, r0 /* quotient to value */
/* alternative 1 - ASCII hexdigit lookup */
ldrb r2, [r8, r2] /* hexdigit lookup */
/* alternative 2 - add to obtain ASCII hexdigit */
cmp r2, 10 /* compare digit to 10 */
addlo r2, r2, '0' /* convert to digits '0'-'9' */
addhs r2, r2, 'W' /* convert to 'a'-'f' */
strb r2, [r5], 1 /* store in tmp string */
b hexcvtloop
Understanding the reference stated clock-cycles do not account for other factors, interrupts, memory speed, cache-misses, etc..
If my rough estimates of about 3 clock-cycles each for either the hex-digit lookup with ldr
or the cmp
, addlo
, addhs
for adding to the remainder is fair, is there another consideration that would decide between the two approaches, or is it basically personal preference at that point?
(I'm not overly concerned with getting a cortex-a53 specific answer, but am more interested in if there are other ARM general metrics I would look to next -- or if it's just "up to you" at this point)
Upvotes: 0
Views: 162
Reputation: 365727
Hopefully this general-case loop using udiv
doesn't run very often, with special-case blocks handling base 10 and 161.
It's important to consider whether the a block is hot or not. If it's not, don't spend much time looking it it, or optimize for size (usually overall size including data) in ways that don't lose badly on speed. That means don't use the LUT. (Except that if you have a specialized int->hex loop, it could share the same LUT, e.g. loaded as one 16-byte SIMD vector.)
On an in-order CPU like A53, per-instruction latency isn't really separate from throughput, but the amount of software-pipelining you can use to hide it can matter, since the HW won't do it for you. Especially for load latency: Load and then store the result in the next instruction is something to avoid, since it'll probably stall for about the load-use latency.
You always want to find some work you can do between a load and the first use, if you care about your code ever running on in-order cores. That might mean rearranging the loop. For example, you can probably put the store after preparing the next digit. In this slow general case, that's a udiv
which should be plenty slow to hide load-use latency. (High-performance in-order ARM cores still do have memory-level parallelism and scoreboard loads or something.)
There can be a tradeoff between code cache footprint vs. data cache footprint, too, depending on function alignment of surrounding functions. If this runs infrequently, it will likely miss in data cache when it does run, and then the ldr
will be a lot more expensive than normal, especially on a high-end CPU. (High clock frequency means DRAM or even L2 latency is many CPU clocks).
Also, if there's no other "hot" data in the whole page where the LUT would have been, that's a dTLB entry that doesn't need to stay hot, or doesn't need to evict another one.
Footnote 1: Most int->string conversions will use base 10 or 16 which are worth checking for specifically if you care about performance, so you can avoid udiv
. (Multiplicative inverse for base 10).
Base 16 is very special because it's a power of 2, and a whole number of 4-bit groups fills a 32-bit integer. You can even generate digits in MSD-first order because each digit depends only on those bits, not all higher bits, instead of only LSD-first for bases div / mod that you have to store in reverse order. See x86 asm How to convert a binary integer number to a hex string?. Like SSSE3, ARM NEON also has a byte shuffle you could use to do all the hex digits in parallel, adapting one of the strategies from the x86 SIMD parts of my linked answer.
Style: 'W'
is a weird way to express -10 + 'a'
. It's the right number, but the reason why has little to do with it representing capital W, so the semantic meaning is all wrong.
Also, you want ldr r8, =hexdigitadr
to get the address in a reg, not load from that label address. Or a manual add
with PC if you put the table itself nearby, instead of the address in a literal pool.
And like I alluded to earlier, this generates digits in reverse order, but you're storing them forwards in memory, so they'd need reversing.
Upvotes: 2