adrianton3
adrianton3

Reputation: 2358

What is the fastest way to add 16 small numbers

I have two arrays, a and b. Each contain 16 bytes and I would like to add each b[i] to their corresponding a[i]. The arrays do not overlap and also I know that the resulting sums always fit in a byte each (important!).

void add16_reference (uint8_t *as, uint8_t *bs) {
   for (auto i = 0; i < 16; i++) {
     as[i] += bs[i];
   }
}

I tried reimplementing this function in a bunch of ways and the best I could come up was

typedef unsigned __int128 uint128_t;

void add16_v3 (uint8_t *as, uint8_t *bs) {
    uint128_t a, b, s;

    std::memcpy(&a, as, 16);
    std::memcpy(&b, bs, 16);

    s = a + b;
    std::memcpy(as, &s, 16);
}

Both GCC and Clang will happily compile this to 2 movs and 2 adds which is great but I can't help but wonder if there are faster ways I'm just not aware of.

I reasoned that I can use a single addition because I know the individual sums always fit in a byte.

I've used godbolt (bless them) to inspect the resulting code https://godbolt.org/z/h4adMTnKn ; I can see that sometimes the compilers emit SIMD (?) instructions and sometimes not. This is important because in the benchmarks using plain old movs/adds is ~8 times faster https://quick-bench.com/q/z0374AXew8_eL8eDoQXn9XlJm9g It also seems to me that there's no sure way to convince the compilers to optimize one way or another (see that v1 compiles to movs/adds on godbolt while on quickbench it uses simd instead).

Other things I tried: using reinterpret_cast instead of memcpy - but that generates simd looking instructions no matter how I shuffle it around; also just adding restrict to as and bs (in C this time) to the reference implementation seems to always emit simds (but I guess that's normal).

With all of the above in mind: is there a faster/nicer/more predictable way of adding these numbers together?

EDIT: Thanks @Jan for pointing out that the benchmarks are incorrect. Perhaps this https://quick-bench.com/q/EMyFsTi7w8wKx4lO_EPQg_FvVb0 updated version is correct

Upvotes: 5

Views: 300

Answers (4)

Peter Cordes
Peter Cordes

Reputation: 365267

With uint8_t *__restrict as for the first arg, your add16_reference auto-vectorizes to unaligned loads/stores around a paddb xmm1, xmm0 instruction. (Godbolt). Or with AVX, it can use an unaligned memory source operand.
paddb is a packed SIMD add of 16x u8 elements, without carry-propagation between elements, so it exactly implements the uint8_t wrapping semantics of your original C++ source. On some old CPUs, paddb is a bit more efficient than paddq (2x 64-bit elements). See https://uops.info/ and https://agner.org/optimize/

When inlining for larger loops, a compiler would hopefully spend the extra instructions to check for overlap between source and destination, and use a vectorized loop if it didn't overlap. Unless you're doing lots of small blocks, in which case it might decide it wasn't profitable to do the checks.


If some of the build targets you care about don't have SIMD at all, then yeah it might be worth futzing with memcpy for SWAR in a uint64_t or uint32_t (for 32-bit ARM Cortex-M microcontrollers). All(?) mainstream 64-bit ISAs these days except RISC-V64 have SIMD instructions as baseline. (x86-64 has SSE2, AArch64 has NEON/ASIMD).

RISC-V extension V exists, but not all cores have it so SWAR hacks are probably once again useful. And it's a variable-length vector that depends on the hardware width, so you have to set up a mask to use it for short fixed-length stuff. Not quite as cheap as on AArch64 where you can just use ASIMD 128-bit vectors when they're perfect for the job, even if SVE is available. But hopefully most of the setup cost can be outside an outer loop.

Upvotes: 2

wepajakeg
wepajakeg

Reputation: 53

Cast the arrays of bytes to uint64_t and perform the 16 sums with 2 64-bit add instructions.

In x-86, if you compile with -O3, it will be compiled to a single XMM 128-bit addition.(Compiler explorer)

It compiles to

movdqa xmm0, XMMWORD PTR [rsp]   ; Load 128 bits from [rsp] (array a) into xmm0
paddq xmm0, XMMWORD PTR [rsp+16] ; Add 128 bits from [rsp+16] (array b) to xmm0
movaps XMMWORD PTR [rsp], xmm0   ; Store the result back into [rsp] (array a)

Here is the code

#include <iostream>
#include <cstdint>
#include <cstdlib>
#include <ctime>
#include <iomanip>

// Define a union that can hold either an array of 8 bytes or a 64-bit unsigned integer
union ByteArray64
{
    uint8_t bytes[8];
    uint64_t uint64;
};

inline void add16_optimized(uint8_t *as, uint8_t *bs)
{
    // Cast the input arrays to arrays of the union type
    union ByteArray64 *a64 = (union ByteArray64 *)as;
    union ByteArray64 *b64 = (union ByteArray64 *)bs;

    // Perform the addition using 64-bit arithmetic
    a64[0].uint64 += b64[0].uint64;
    a64[1].uint64 += b64[1].uint64;
}

void printArray(uint8_t *array, size_t length)
{
    for (size_t i = 0; i < length; i++)
        std::cout << std::setw(3) << static_cast<unsigned>(array[i]) << " ";
    std::cout << std::endl;
}

void generateRandomArray(uint8_t *array, size_t length, int maxValue)
{
    for (size_t i = 0; i < length; i++)
        array[i] = rand() % (maxValue + 1);
}

int main()
{
    const size_t arrayLength = 16;
    uint8_t a[arrayLength];
    uint8_t b[arrayLength];

    // Seed the random number generator
    std::srand(std::time(0));

    // Generate random arrays with values between 0 and 125
    generateRandomArray(a, arrayLength, 125);
    generateRandomArray(b, arrayLength, 125);

    std::cout << "Array a:" << std::endl;
    printArray(a, arrayLength);

    std::cout << "Array b:" << std::endl;
    printArray(b, arrayLength);

    add16_optimized(a, b);

    // Print the result to verify
    std::cout << "Resulting array after addition:" << std::endl;
    printArray(a, arrayLength);

    return 0;
}

Upvotes: 0

KevinZ
KevinZ

Reputation: 3319

Your v2 is better than v1 because the writes didn't clobber the reads, but it just looks ugly. Your v3 is worse because of carry.

There are ways to improve upon v2. First, use architecture flags. Even if you pass something like -march=sandybridge from 13 years ago, it will use 1 AVX load. Second, choose your poison, do you want to coax the compiler with various promises in hope that it generates the right code, or do you want to use intrinsics? If you are fine with intrinsics, then here it is: https://godbolt.org/z/W8oWE65vW

void add16_v4 (uint8_t *as, uint8_t *bs)
{
    auto const a = _mm_loadu_si128((__m128i const *) as);
    auto const b = _mm_loadu_si128((__m128i const *) bs);
    _mm_storeu_si128( (__m128i *) as, _mm_add_epi8(a,b) );
}

Upvotes: 1

Jan Schultke
Jan Schultke

Reputation: 39774

Your naive function didn't compile to a parallelized addition because there is potential aliasing between as and bs. After every as[i] += bs[i], it's possible that the modification of as[i] also modified some bs[j] if the pointers alias each other.

You can get better codegen by using __restrict (see also What does the restrict keyword mean in C++?):

void add16_reference (uint8_t * __restrict as, uint8_t * __restrict bs) {
   for (auto i = 0; i < 16; i++) {
     as[i] += bs[i];
   }
}

GCC, Clang, and MSVC all use the same keyword, so __restrict is surprisingly portable. All these compilers use a paddb instruction for this C++ code, which performs all 16 additions at once, and that should be the fastest method.

Another option is to use Intel SIMD intrinsics; the whole function can be replaced with _mm_add_epi8.

The other versions that you've written give you more or less optimal codegen as well, but this varies a lot from compiler to compiler, and they are not particularly readable.

Note on micro-bechmarking

This is important because in the benchmarks using plain old movs/adds is ~8 times faster https://quick-bench.com/q/z0374AXew8_eL8eDoQXn9XlJm9g

Your benchmark is wrong. You're simply defining the input to each function as a local array known at compile time. For v3, the compiler optimizes your "benchmark" to:

movabs rax,0x807060504030201
movabs rdx,0x100f0e0d0c0b0a09
mov    rsi,rax
mov    rdi,rdx
add    rax,rsi

For whatever reason, it does not find this optimization for v1 and v2 and instead loads the input data from XMMWORD PTR [rip+0x1b4a2]. Most of the time also seems to be spent in unnecessary movs emitted as the result of benchmark::DoNotOptimize, not on the addition itself.

This demonstrates the importance of writing clean, idiomatic code that a compiler can understand. All the memcpy nonsense is obfuscating your code and making it harder to optimize for GCC.

Upvotes: 5

Related Questions