Reputation: 189
Below is the code which I got while searching for an optimized memcpy
implementation.
Here's the link
void *memcpy(void *dst, void const *src, size_t len) {
long *plDst = (long *)dst;
long const *plSrc = (long const *)src;
if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC)) {
while (len >= 4) {
*plDst++ = *plSrc++;
len -= 4;
}
}
char *pcDst = (char *)plDst;
char const *pcDst = (char const *)plSrc;
while (len--) {
*pcDst++ = *pcSrc++;
}
return (dst);
}
Can somebody explain the line below to me?
if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
Here they want to check if src
and dst
address are aligned to a 4 byte
boundary or not. Why are they using !
though, as it will make the condition false
every time?
Secondly, is there any scope for more optimization in the above code?
Upvotes: 5
Views: 2928
Reputation: 144520
This article, while addressing an interesting subject, falls short of providing correct examples. The code posted is referred to as GNU's newlib source code. Both the GNU project and the newlib team will be surprised to learn of this unexpected convergence statement! the newlib is not a GNU project and most of its source code is not GPL licenced.
This optimized implementation of memcpy
is non portable, sub-optimal and in many aspects incorrect.
The test if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
attempts to detect if the src
and dst
addresses are aligned on long
boundaries. It is cumbersome and non portable for multiple reasons and downright incorrect as you noticed:
void *
to int
is ugly and implementation defined. The pointers should be cast as (uintptr_t)
for better portability.0xFFFFFFFC
assumes type long
is 4 bytes. This may be incorrect, and indeed type long
is 8-byte long on 64-bit linux and Mac systems.src & 0xFFFFFFFC
is not a check for alignment and is unlikely to be 0
, the intended test for alignment on 4-byte boundaries is src & 3
.Furthermore, the code fails to optimize the case where src
and dst
have the same alignment but are not aligned on long
boundaries.
Other possible improvements include unrolling the loops, using a switch for small values of len
, combining bytes read from src
into long
s to write to dst
once its is aligned on long
boundaries...
Here is an improved alternative:
#include <stdint.h>
void *memcpy(void *dst, void const *src, size_t len) {
unsigned char *pcDst = (unsigned char *)dst;
unsigned char const *pcSrc = (unsigned char const *)src;
if (len >= sizeof(long) * 2
&& ((uintptr_t)src & (sizeof(long) - 1)) == ((uintptr_t)dst & (sizeof(long) - 1))) {
while (((uintptr_t)pcSrc & (sizeof(long) - 1)) != 0) {
*pcDst++ = *pcSrc++;
len--;
}
long *plDst = (long *)pcDst;
long const *plSrc = (long const *)pcSrc;
/* manually unroll the loop */
while (len >= sizeof(long) * 4) {
plDst[0] = plSrc[0];
plDst[1] = plSrc[1];
plDst[2] = plSrc[2];
plDst[3] = plSrc[3];
plSrc += 4;
plDst += 4;
len -= sizeof(long) * 4;
}
while (len >= sizeof(long)) {
*plDst++ = *plSrc++;
len -= sizeof(long);
}
pcDst = (unsigned char *)plDst;
pcSrc = (unsigned char const *)plSrc;
}
while (len--) {
*pcDst++ = *pcSrc++;
}
return dst;
}
Note that the casts from void *
are unnecessary in C, but required in C++.
Here are some important points to keep in mind when trying to optimize code for speed:
memcpy
is usually optimized in assembly or implemented as a built-in by modern compilers.Upvotes: 4