Reputation: 395

How can I implement a fast copying function like memcpy()?

I've seen a few answers on how memcpy() can achieve faster speed than a naive byte-by-byte copy. Most of them suggest something along the lines of:

void *my_memcpy(void *dest, const void *src, size_t n) {
    uint64_t *d = dest;
    const uint64_t *s = src;
    n /= sizeof(uint64_t);

    while (n--)
        *d++ = *s++;

    return dest;
}

which to my understanding (correct me if I'm wrong) can violate the strict aliasing assumption and cause undefined behavior. To keep it simple assume that n and the alignment and size of src and dest are multiples of 8.

If my_memcpy can indeed cause undefined behavior, I want to know how memcpy manages to copy multiple bytes at a time without violating any compiler assumptions. An example of any working implementation for x64 would help.

Suggestions to use the library routine won't work. I'm actually not writing my own memcpy. I'm writing a function that can use a similar optimization but AFAIK is unavailable in the C standard.

Upvotes: 1

Answers (4)

supercat

Reputation: 81227

Efficiently exploiting the features of a particular target architecture will often require the use of non-portable code, but the authors of the Standard have expressly recognized that:

C code can be non-portable. [emphasis original] Although it strove to give programmers the opportunity to write truly portable programs, the C89 Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler”: the ability to write machine-specific code is one of the strengths of C. It is this principle which largely motivates drawing the distinction between strictly conforming program and conforming program (§4).

Chunking optimizations require the use of a popular extension nearly all implementations can be configured to support. Using the -fno-strict-aliasing flag to enable this extension in gcc and clang may yield poor performance unless code uses the restrict qualifier when appropriate, but that should be blamed on the failure to properly use restrict. The performance penalty of -fno-strict-aliasing is small in code that properly uses restrict, while failure to use restrict will often impose a significant performance penalty even without -fno-strict-aliasing.

Upvotes: 0

Antti Haapala -- Слава Україні

Reputation: 133988

Tstenner already detailed the most important points.

But I'll add this: if you're coding in C and your compiler is cleverer than you, it will notice that you wrote a bad version of memcpy and will replace it by a call to the actual builtin memcpy. For example this:

#include <stdlib.h>

void *mymemcpy(void *restrict dest, const void * restrict src, size_t n) {
   char *csrc = (char *)src; 
   char *cdest = (char *)dest; 

   for (size_t i=0; i<n; i++) 
       cdest[i] = csrc[i]; 

   return dest;
}

Compile with GCC 9.1 and the resulting assembly is

mymemcpy:
        test    rdx, rdx
        je      .L7
        sub     rsp, 8
        call    memcpy
        add     rsp, 8
        ret
.L7:
        mov     rax, rdi
        ret

That, given that you're not trying to be too clever...

Upvotes: 0

Lundin

Reputation: 214385

Portably, you should copy on alignment basis, which isn't necessarily uint64_t. In theory, you should be using uint_fast8_t but in practice that one is apparently 1 byte big, 1 byte aligned on most systems. If portability isn't required, you can stick with uint64_t.

The next problem is that the pointers passed to memcpy are not necessarily pointing at an aligned address, as per the requirement of the standard function to work regardless of alignment. So you'll have to do something like this:

size_t prealign = (uintptr_t)src % _Alignof(uint64_t);
if(prealign != 0)
{
  // copy bytes up to next aligned address
}

Same for destination, and same for the end of the data.

which to my understanding (correct me if I'm wrong) can violate the strict aliasing assumption and cause undefined behavior.

Correct. So in order to copy uint64_t chunks you either have to write the code in inline assembler, or you have to disable strict aliasing in a non-standard way when compiling, such as gcc -fno-strict-aliasing.

The "real" library memcpy is treated as a special case by the compiler, as are many other such library functions. memcpy(&foo, &bar, sizeof(int)); will for example get translated to a single mov instruction, inlined in the caller code, without memcpy being called at all.

Another note about pointer aliasing is that you should restrict qualify the pointers as done with the real memcpy. This tells the compiler that it can assume that the dest and src pointers aren't the same, or that they overlap, meaning that the compiler doesn't need to add checks or overhead code for that scenario.

Amusingly, when I write the following naive copy function:

#include <stdint.h>
#include <stddef.h>

void foocpy (void* dst, const void* src, size_t n)
{
  uint8_t* u8_dst = dst;
  const uint8_t* u8_src = src;

  for(size_t i=0; i<n; i++)
  {
    u8_dst[i] = u8_src[i];
  }
}

Then the compiler gives me a tonne of fairly inefficient machine code. But if I simply add restrict to both pointers, the whole functions gets replaced with this:

foocpy:
        test    rdx, rdx
        je      .L1
        jmp     memcpy
.L1:
        ret

This again shows that the built-in memcpy is a treated as a special snowflake by the compiler.

Upvotes: 4

tstenner

Reputation: 10301

memcpy is a special function that the compiler can replace with a builtin version, e.g. if it can prove that two arrays don't overlap.

The actual, fast implementations almost always use assembler and special intrinsics (e.g. glibc with SSSE3), but other libc implementations might implement it in C (e.g. musl).

Upvotes: 4

How can I implement a fast copying function like memcpy()?

Answers (4)

Related Questions