Reputation: 395
I've seen a few answers on how memcpy()
can achieve faster speed than a naive byte-by-byte copy. Most of them suggest something along the lines of:
void *my_memcpy(void *dest, const void *src, size_t n) {
uint64_t *d = dest;
const uint64_t *s = src;
n /= sizeof(uint64_t);
while (n--)
*d++ = *s++;
return dest;
}
which to my understanding (correct me if I'm wrong) can violate the strict aliasing assumption and cause undefined behavior. To keep it simple assume that n
and the alignment and size of src
and dest
are multiples of 8.
If my_memcpy
can indeed cause undefined behavior, I want to know how memcpy
manages to copy multiple bytes at a time without violating any compiler assumptions. An example of any working implementation for x64 would help.
Suggestions to use the library routine won't work. I'm actually not writing my own memcpy
. I'm writing a function that can use a similar optimization but AFAIK is unavailable in the C standard.
Upvotes: 1
Views: 2132
Reputation: 81227
Efficiently exploiting the features of a particular target architecture will often require the use of non-portable code, but the authors of the Standard have expressly recognized that:
C code can be non-portable. [emphasis original] Although it strove to give programmers the opportunity to write truly portable programs, the C89 Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler”: the ability to write machine-specific code is one of the strengths of C. It is this principle which largely motivates drawing the distinction between strictly conforming program and conforming program (§4).
Chunking optimizations require the use of a popular extension nearly all implementations can be configured to support. Using the -fno-strict-aliasing
flag to enable this extension in gcc and clang may yield poor performance unless code uses the restrict
qualifier when appropriate, but that should be blamed on the failure to properly use restrict
. The performance penalty of -fno-strict-aliasing
is small in code that properly uses restrict
, while failure to use restrict
will often impose a significant performance penalty even without -fno-strict-aliasing
.
Upvotes: 0
Reputation: 133988
Tstenner already detailed the most important points.
But I'll add this: if you're coding in C and your compiler is cleverer than you, it will notice that you wrote a bad version of memcpy
and will replace it by a call to the actual builtin memcpy
. For example this:
#include <stdlib.h>
void *mymemcpy(void *restrict dest, const void * restrict src, size_t n) {
char *csrc = (char *)src;
char *cdest = (char *)dest;
for (size_t i=0; i<n; i++)
cdest[i] = csrc[i];
return dest;
}
Compile with GCC 9.1 and the resulting assembly is
mymemcpy:
test rdx, rdx
je .L7
sub rsp, 8
call memcpy
add rsp, 8
ret
.L7:
mov rax, rdi
ret
That, given that you're not trying to be too clever...
Upvotes: 0
Reputation: 214385
Portably, you should copy on alignment basis, which isn't necessarily uint64_t
. In theory, you should be using uint_fast8_t
but in practice that one is apparently 1 byte big, 1 byte aligned on most systems. If portability isn't required, you can stick with uint64_t
.
The next problem is that the pointers passed to memcpy
are not necessarily pointing at an aligned address, as per the requirement of the standard function to work regardless of alignment. So you'll have to do something like this:
size_t prealign = (uintptr_t)src % _Alignof(uint64_t);
if(prealign != 0)
{
// copy bytes up to next aligned address
}
Same for destination, and same for the end of the data.
which to my understanding (correct me if I'm wrong) can violate the strict aliasing assumption and cause undefined behavior.
Correct. So in order to copy uint64_t
chunks you either have to write the code in inline assembler, or you have to disable strict aliasing in a non-standard way when compiling, such as gcc -fno-strict-aliasing
.
The "real" library memcpy is treated as a special case by the compiler, as are many other such library functions. memcpy(&foo, &bar, sizeof(int));
will for example get translated to a single mov
instruction, inlined in the caller code, without memcpy
being called at all.
Another note about pointer aliasing is that you should restrict
qualify the pointers as done with the real memcpy. This tells the compiler that it can assume that the dest
and src
pointers aren't the same, or that they overlap, meaning that the compiler doesn't need to add checks or overhead code for that scenario.
Amusingly, when I write the following naive copy function:
#include <stdint.h>
#include <stddef.h>
void foocpy (void* dst, const void* src, size_t n)
{
uint8_t* u8_dst = dst;
const uint8_t* u8_src = src;
for(size_t i=0; i<n; i++)
{
u8_dst[i] = u8_src[i];
}
}
Then the compiler gives me a tonne of fairly inefficient machine code. But if I simply add restrict
to both pointers, the whole functions gets replaced with this:
foocpy:
test rdx, rdx
je .L1
jmp memcpy
.L1:
ret
This again shows that the built-in memcpy
is a treated as a special snowflake by the compiler.
Upvotes: 4
Reputation: 10301
memcpy
is a special function that the compiler can replace with a builtin version, e.g. if it can prove that two arrays don't overlap.
The actual, fast implementations almost always use assembler and special intrinsics (e.g. glibc with SSSE3), but other libc implementations might implement it in C (e.g. musl).
Upvotes: 4