Reputation: 21
How would one go about writing a function which would copy a given number of bytes from a given source to a given destination in AARCH64 assembly language? This would basically be like memcpy, but with some additional assumptions.
The best resource I have found so far is this source code for memcpy, but I don't understand which parts are relevant to the specific parameters I was given.
EDIT: So far I've tried converting the following function in C into assembly.
void memcpy80(unsigned char* dest, unsigned char* sourc, unsigned long count)
{
unsigned char* end = sourc + count;
while (sourc < end)
*(dest++) = *(sourc++)
}
I then tried to boil that down into a version that fits the specification like so:
memcpy80:
add x3, x1, x2
loopstart:
cmp x3, x1
//how do I branch to loopend if x1 > x3?
add x0, x0, 1
add x1, x1, 1
ldr x4, [x1]
str x4, [x0]
b loopstart
loopend:
ret
What would be the instruction to use to branch to the end of the loop? And then what else would I need to change to comply with the spec?
This would be optimized on 64-bit ARMv8-a architecture. There's nothing in the spec to say that smaller or larger sizes are more common. There are no benefits to having a code-size smaller than 80B, so going right up to the limit for the sake of greater efficiency would be ideal.
Upvotes: 1
Views: 5110
Reputation: 365237
You definitely don't want to just copy 1 byte at a time. That's ridiculous. You know your problem size is a multiple of 16 bytes, so you should at least use 64-bit long
, if not a 128-bit type.
Your asm implementation uses 8-byte loads/stores (good), but only increments the pointer by 1, so you're re-copying the same data 8 times, with overlap and misalignment.
If you don't know (yet) how to structure a loop efficiently in asm, with the conditional branch at the bottom, you should probably write in C and tweak the C + compiler options until you get what you want. This is not a bad way to learn asm; you can see how the compiler-generated code implements your simple loop setup + the loop itself.
This would be optimized on 64-bit ARMv8-a architecture.
There are multiple different implementations of that architecture, each with different microarchitecture and different performance characteristics. Optimizing for an in-order core is vastly different from out-of-order. On an in-order CPU, you have to hide all the latency yourself to keep multiple operations in flight at once. Execution stalls if the data for a store isn't ready. But an out-of-order CPU lets hardware keep multiple iterations of the loop in flight itself, starting the load from the next iteration while the store from this iteration is still waiting for data. (Memory and cache are pipelined, so this is important.)
For example, Cortex-A53 has a dual issue, in-order pipeline, while Cortex-A57 has a 3-way superscalar, deeply out-of-order pipeline. (wikipedia).
I played around with this some on the Godbolt compiler explorer, with AArch64 gcc 5.4 and 6.3.
void memcpy80_simple_optimizations(unsigned long *restrict dst, unsigned long* restrict src, unsigned long len){
dst = __builtin_assume_aligned(dst, 16);
src = __builtin_assume_aligned(src, 16);
unsigned long* end = src + len; // len is in 8byte units here, but implement however you want
while (src < end) {
unsigned long tmp1 = *src; // do both loads ahead of both stores for better pipelining on in-order chips
unsigned long tmp2 = src[1]; // gcc seems to do a bad job
dst[0] = tmp1; // unroll by 2 because we know len is a multiple of 16 bytes
dst[1] = tmp2;
src+=2;
dst+=2;
}
}
This compiles with gcc6.3 -O3 -fno-builtin -fno-tree-vectorize
(so gcc doesn't recognize the pattern and compile it into a call to memcpy
!)
add x2, x1, x2, lsl 3 # leave out the lsl 3 for count in bytes
cmp x1, x2
bcs .L8 # skip first iteration for small size
.L12:
ldr x3, [x1, 8] # x3 = memory[x1+8]
add x0, x0, 16
ldr x4, [x1], 16 # x4 = memory[x1], x1+=16 (post-increment)
cmp x2, x1
stp x4, x3, [x0, -16] # store pair
bhi .L12
.L8:
ret
This is probably a good starting point for loop structure; only 10 instructions (and thus 40 bytes). It could also be more efficient, using ldp
to load 2 registers would probably be best, like how it uses stp
to store both x4 and x3 with one instruction. There might be room in 80 bytes to unroll by 32 if the startup code figures out where to jump into the loop based on count & 16
(whether to copy an odd or even number of 16-byte chunks). But that might interfere with doing both loads before both stores.
You could probably also reduce loop overhead with a more clever choice of addressing modes.
I think it's more optimal to do both loads and then both stores. With *dst++ = *src++
twice, I was getting load/store, load/store. I don't know if gcc was ignoring the restrict
qualifier on the pointers (which tells is they don't overlap), or if it's somehow better to alternate loads and stores on AArch64 instead of doing multiple loads and then multiple stores. There was a bit of discussion on an LLVM patch proposal about how that compiler should inline memcpy
for small fixed-size copies, and there was some suggestion that ldr/str / ldp/stp
was maybe better, but maybe only compared to one where the stp
stores data from an ldr and the first half of an ldp
. I'm not sure what they were implying.
glibc's AArch64 memcpy
uses this inner loop for large copies. dst
, src
, and count
(and A_l
and so on) are preprocessor macros for x
registers.
L(loop64):
stp A_l, A_h, [dst, 16]
ldp A_l, A_h, [src, 16]
stp B_l, B_h, [dst, 32]
ldp B_l, B_h, [src, 32]
stp C_l, C_h, [dst, 48]
ldp C_l, C_h, [src, 48]
stp D_l, D_h, [dst, 64]!
ldp D_l, D_h, [src, 64]!
subs count, count, 64
b.hi L(loop64)
Note that each store instruction is storing data loaded in the previous iteration, so the distance from each load to each store is all the way around the loop, minus 1 instruction. This is why it starts with a store, and ends with a load. The code before the loop loads 4 pairs of x
registers, and the code after it stores what the loop left in registers.
You could adapt this technique down to 2 pairs.
But anyway, clearly whoever wrote / tuned the glibc implementation thinks that AArch64 typically benefits from software pipelining, the opposite of storing data you just loaded.
This is backed up by testing on Cortex-A53 and A57, according to mailing list comments from the author, reporting speedups over the previous version.
Upvotes: 7