Quadratic Ocelot
Quadratic Ocelot

Reputation: 21

Using AARCH64 assembly to write a memcpy

How would one go about writing a function which would copy a given number of bytes from a given source to a given destination in AARCH64 assembly language? This would basically be like memcpy, but with some additional assumptions.

  1. The source and destination are aligned on 16 byte boundaries
  2. The memory regions do not overlap
  3. The count is a multiple of 16
  4. The instructions must fit within 80 bytes

The best resource I have found so far is this source code for memcpy, but I don't understand which parts are relevant to the specific parameters I was given.

EDIT: So far I've tried converting the following function in C into assembly.

void memcpy80(unsigned char* dest, unsigned char* sourc, unsigned long count)
{
unsigned char* end = sourc + count;
while (sourc < end)
    *(dest++) = *(sourc++)
}

I then tried to boil that down into a version that fits the specification like so:

memcpy80:
    add x3, x1, x2
loopstart:
    cmp x3, x1
    //how do I branch to loopend if x1 > x3?
    add x0, x0, 1
    add x1, x1, 1
    ldr x4, [x1]
    str x4, [x0]
    b loopstart
loopend:
    ret

What would be the instruction to use to branch to the end of the loop? And then what else would I need to change to comply with the spec?

This would be optimized on 64-bit ARMv8-a architecture. There's nothing in the spec to say that smaller or larger sizes are more common. There are no benefits to having a code-size smaller than 80B, so going right up to the limit for the sake of greater efficiency would be ideal.

Upvotes: 1

Views: 5110

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 365237

You definitely don't want to just copy 1 byte at a time. That's ridiculous. You know your problem size is a multiple of 16 bytes, so you should at least use 64-bit long, if not a 128-bit type.

Your asm implementation uses 8-byte loads/stores (good), but only increments the pointer by 1, so you're re-copying the same data 8 times, with overlap and misalignment.

If you don't know (yet) how to structure a loop efficiently in asm, with the conditional branch at the bottom, you should probably write in C and tweak the C + compiler options until you get what you want. This is not a bad way to learn asm; you can see how the compiler-generated code implements your simple loop setup + the loop itself.

This would be optimized on 64-bit ARMv8-a architecture.

There are multiple different implementations of that architecture, each with different microarchitecture and different performance characteristics. Optimizing for an in-order core is vastly different from out-of-order. On an in-order CPU, you have to hide all the latency yourself to keep multiple operations in flight at once. Execution stalls if the data for a store isn't ready. But an out-of-order CPU lets hardware keep multiple iterations of the loop in flight itself, starting the load from the next iteration while the store from this iteration is still waiting for data. (Memory and cache are pipelined, so this is important.)

For example, Cortex-A53 has a dual issue, in-order pipeline, while Cortex-A57 has a 3-way superscalar, deeply out-of-order pipeline. (wikipedia).


I played around with this some on the Godbolt compiler explorer, with AArch64 gcc 5.4 and 6.3.

void memcpy80_simple_optimizations(unsigned long *restrict dst, unsigned long* restrict src, unsigned long len){
    dst = __builtin_assume_aligned(dst, 16);
    src = __builtin_assume_aligned(src, 16);
    unsigned long* end = src + len;  // len is in 8byte units here, but implement however you want
    while (src < end) {
        unsigned long tmp1 = *src;    // do both loads ahead of both stores for better pipelining on in-order chips
        unsigned long tmp2 = src[1];  // gcc seems to do a bad job  
        dst[0] = tmp1;  // unroll by 2 because we know len is a multiple of 16 bytes
        dst[1] = tmp2;
        src+=2;
        dst+=2;
    }
}

This compiles with gcc6.3 -O3 -fno-builtin -fno-tree-vectorize (so gcc doesn't recognize the pattern and compile it into a call to memcpy!)

    add     x2, x1, x2, lsl 3    # leave out the lsl 3 for count in bytes
    cmp     x1, x2
    bcs     .L8                  # skip first iteration for small size
.L12:
    ldr     x3, [x1, 8]          # x3 = memory[x1+8]
    add     x0, x0, 16
    ldr     x4, [x1], 16         # x4 = memory[x1], x1+=16 (post-increment)
    cmp     x2, x1
    stp     x4, x3, [x0, -16]    # store pair
    bhi     .L12
.L8:
    ret

This is probably a good starting point for loop structure; only 10 instructions (and thus 40 bytes). It could also be more efficient, using ldp to load 2 registers would probably be best, like how it uses stp to store both x4 and x3 with one instruction. There might be room in 80 bytes to unroll by 32 if the startup code figures out where to jump into the loop based on count & 16 (whether to copy an odd or even number of 16-byte chunks). But that might interfere with doing both loads before both stores.

You could probably also reduce loop overhead with a more clever choice of addressing modes.

I think it's more optimal to do both loads and then both stores. With *dst++ = *src++ twice, I was getting load/store, load/store. I don't know if gcc was ignoring the restrict qualifier on the pointers (which tells is they don't overlap), or if it's somehow better to alternate loads and stores on AArch64 instead of doing multiple loads and then multiple stores. There was a bit of discussion on an LLVM patch proposal about how that compiler should inline memcpy for small fixed-size copies, and there was some suggestion that ldr/str / ldp/stp was maybe better, but maybe only compared to one where the stp stores data from an ldr and the first half of an ldp. I'm not sure what they were implying.

glibc's AArch64 memcpy uses this inner loop for large copies. dst, src, and count (and A_l and so on) are preprocessor macros for x registers.

L(loop64):
    stp        A_l, A_h, [dst, 16]
    ldp        A_l, A_h, [src, 16]
    stp        B_l, B_h, [dst, 32]
    ldp        B_l, B_h, [src, 32]
    stp        C_l, C_h, [dst, 48]
    ldp        C_l, C_h, [src, 48]
    stp        D_l, D_h, [dst, 64]!
    ldp        D_l, D_h, [src, 64]!
    subs        count, count, 64
    b.hi        L(loop64)

Note that each store instruction is storing data loaded in the previous iteration, so the distance from each load to each store is all the way around the loop, minus 1 instruction. This is why it starts with a store, and ends with a load. The code before the loop loads 4 pairs of x registers, and the code after it stores what the loop left in registers.

You could adapt this technique down to 2 pairs.

But anyway, clearly whoever wrote / tuned the glibc implementation thinks that AArch64 typically benefits from software pipelining, the opposite of storing data you just loaded.

This is backed up by testing on Cortex-A53 and A57, according to mailing list comments from the author, reporting speedups over the previous version.

Upvotes: 7

Related Questions