Reputation: 189

Optimization in Memcpy implementation

Below is the code which I got while searching for an optimized memcpy implementation.
Here's the link

void *memcpy(void *dst, void const *src, size_t len) {
    long *plDst = (long *)dst;
    long const *plSrc = (long const *)src;

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC)) {
        while (len >= 4) {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char *pcDst = (char *)plDst;
    char const *pcDst = (char const *)plSrc;

    while (len--) {
         *pcDst++ = *pcSrc++;
    }

    return (dst);
}

Can somebody explain the line below to me?

if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))

Here they want to check if src and dst address are aligned to a 4 byte boundary or not. Why are they using ! though, as it will make the condition false every time?

Secondly, is there any scope for more optimization in the above code?

Upvotes: 5

Answers (1)

chqrlie

Reputation: 144520

This article, while addressing an interesting subject, falls short of providing correct examples. The code posted is referred to as GNU's newlib source code. Both the GNU project and the newlib team will be surprised to learn of this unexpected convergence statement! the newlib is not a GNU project and most of its source code is not GPL licenced.

This optimized implementation of memcpy is non portable, sub-optimal and in many aspects incorrect.

The test if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC)) attempts to detect if the src and dst addresses are aligned on long boundaries. It is cumbersome and non portable for multiple reasons and downright incorrect as you noticed:

implicit conversion from void * to int is ugly and implementation defined. The pointers should be cast as (uintptr_t) for better portability.
0xFFFFFFFC assumes type long is 4 bytes. This may be incorrect, and indeed type long is 8-byte long on 64-bit linux and Mac systems.
src & 0xFFFFFFFC is not a check for alignment and is unlikely to be 0, the intended test for alignment on 4-byte boundaries is src & 3.

Furthermore, the code fails to optimize the case where src and dst have the same alignment but are not aligned on long boundaries.

Other possible improvements include unrolling the loops, using a switch for small values of len, combining bytes read from src into longs to write to dst once its is aligned on long boundaries...

Here is an improved alternative:

#include <stdint.h>

void *memcpy(void *dst, void const *src, size_t len) {
    unsigned char *pcDst = (unsigned char *)dst;
    unsigned char const *pcSrc = (unsigned char const *)src;

    if (len >= sizeof(long) * 2
    &&  ((uintptr_t)src & (sizeof(long) - 1)) == ((uintptr_t)dst & (sizeof(long) - 1))) {
        while (((uintptr_t)pcSrc & (sizeof(long) - 1)) != 0) {
            *pcDst++ = *pcSrc++;
            len--;
        }
        long *plDst = (long *)pcDst;
        long const *plSrc = (long const *)pcSrc;
        /* manually unroll the loop */
        while (len >= sizeof(long) * 4) {
            plDst[0] = plSrc[0];
            plDst[1] = plSrc[1];
            plDst[2] = plSrc[2];
            plDst[3] = plSrc[3];
            plSrc += 4;
            plDst += 4;
            len -= sizeof(long) * 4;
        }
        while (len >= sizeof(long)) {
            *plDst++ = *plSrc++;
            len -= sizeof(long);
        }
        pcDst = (unsigned char *)plDst;
        pcSrc = (unsigned char const *)plSrc;
    }
    while (len--) {
         *pcDst++ = *pcSrc++;
    }
    return dst;
}

Note that the casts from void * are unnecessary in C, but required in C++.

Here are some important points to keep in mind when trying to optimize code for speed:

no trade-offs on correctness. Fast code that fails even on border cases is useless.
beware of portability issues: solutions that depend on type sizes and/or byte order may create portability problems.
benchmarking will tell what works and what does not: you must carefully time the various alternatives on a variety of test cases.
there is no perfect solution: faster code on one architecture can be slower on a different one including and most problematically on a future one. Benchmark on different architectures.
fight complexity: avoid convoluted solutions that do not bring substantial improvements, their correctness is more difficult to prove and to maintain.
memcpy is usually optimized in assembly or implemented as a built-in by modern compilers.

Upvotes: 4

Optimization in Memcpy implementation

Answers (1)

Related Questions