Devvy
Devvy

Reputation: 45

How to perform parallel addition using AVX with carry (overflow) fed back into the same element (PE checksum)?

I want to perform eight parallel adds of 16bit values using AVX SIMD. Addition with overflow is required, i.e. 'add with carry' like it is performed with the old "adc" x86 mnemonic.

I implemented 50 percent of the AVX solution myself, the carry handling, also performed by AVX instructions, is missing. My current solution:

typedef union _uint16vec uint16vec, *uint16vec_ptr;

union  __attribute__((aligned(16))) _uint16vec
{
  __m128i             x;
  uint16_t            y[8];
};

__m128i parallel_add_with_carry ( __m128i n1, __m128i n2 )
{
  volatile uint16vec  res, stupid_carry;
  uint32_t            i;

  stupid_carry.x = n1; /* load 8x uint16_t for carry adjustment below */

  __asm__
  (
    "movdqa %1, %%xmm0             \n\t"
    "movdqa %2, %%xmm1             \n\t"
    "vpaddw %%xmm0, %%xmm1, %%xmm0 \n\t"
    "movdqa %%xmm0, %0             \n\t"
    : "=m" (res.x)                      /* output */
    : "m" (n1), "m" (n2)                /* inputs */
    : "xmm0", "xmm1"                    /* GCC, please clobber XMM0 and XMM1 */
  );

  /* if each of the eight uint16_t in the result is lesser than
   * the previous value, then we have the overflow situation...
   */
  for (i=0;i<8;i++)
    res.y[i] += (res.y[i] < stupid_carry.y[i]) ? 1 : 0;

  return res.x;
}

void test ( void )
{
  uint16vec   v1 = {0}, v2 = {0}, res;

  v1.y[0] = 0x000A; v2.y[0] = 0x0014; /* 10+20 = 30 (0x001E), no overflow */
  v1.y[1] = 0xFFF0; v2.y[1] = 0x0013; /* 0xFFF0 + 0x0013 = 0x0003 -> overflow -> 0x0004 */

  res.x = parallel_add_with_carry(v1.x, v2.x);

  fprintf(stdout,"%04X | %04X\n", res.y[0], res.y[1]);
}

The GCC-emitted object code of the function's epilogue is terrible (even with -O3). My question is if there is a better AVX-supported solution for the 'add with carry' problem?

My idea was to provide a 128bit vector { 0x0001,0x0001,...,0x0001} as a 128bit temporary variable adding this (the carry vector) to the eight uint16_t's if and only if the preceding compare operation resulted in a 'lesser than' for specific uint16_t in the 128bit vector.

I browsed the Intel documentation and found nice add instructions that just copy source parts of the vector if a condition is met.

Support with this 'AVX thing' is highly appreciated. Thanks.

Upvotes: 0

Views: 302

Answers (2)

Andrey Semashev
Andrey Semashev

Reputation: 10614

One simple idea is to leverage unsigned saturation, which makes the results of the addition different from the overflowing addition when overflow happens.

__m128i parallel_add_with_carry(__m128i n1, __m128i n2)
{
    // Make an all-ones vector. This is virtually free and does not
    // consume an execution unit.
    const __m128i mm_all_ones = _mm_cmpeq_epi32(n1, n1);

    __m128i mm_sum = _mm_add_epi16(n1, n2);
    // Since there is no "not equal" comparison that produces
    // a vector mask, we compare for "equal" and get a negated
    // carry mask.
    __m128i mm_carry = _mm_cmpeq_epi16(_mm_adds_epu16(n1, n2), mm_sum);
    // Negate the mask, which turns elements that were 0 to -1 and vise versa
    mm_carry = _mm_xor_si128(mm_carry, mm_all_ones);
    // Add carry (i.e. subtract -1 or 0)
    mm_sum = _mm_sub_epi16(mm_sum, mm_carry);

    return mm_sum;
}

Godbolt

This implementation is compatible with SSE2 and later. Using AVX-512 and opmask registers is possible, but unlikely to be any faster. Here are some estimates (asm code generated by gcc 14.2 with -O3 -march=rocketlake):

If you're processing moderate amounts of data that likely fit in cache, it may be beneficial to further optimize this code by unrolling the loop and using separate accumulators for the sum and carry, and only add them together at the end, when all data is processed. This improves instruction-level parallelism (ILP) as the separate accumulations form separate dependency chains and can happen in parallel, provided that there are enough arithmetic execution units in the CPU and the algorithm isn't bottlenecked elsewhere (e.g. by memory bandwidth).

__m128i parallel_add_with_carry_loop(const unsigned char* data, size_t size)
{
    __m128i mm_sum1 = _mm_setzero_si128(), mm_sum2 = _mm_setzero_si128();
    __m128i mm_carry1 = _mm_setzero_si128(), mm_carry2 = _mm_setzero_si128();
    const __m128i mm_all_ones = _mm_cmpeq_epi32(mm_sum1, mm_sum1);

    for (size_t i = 0u, n = size & (size_t)(-32); i < n; i += 32)
    {
        __m128i mm1 = _mm_loadu_si128((const __m128i*)(data + i));
        __m128i mm2 = _mm_loadu_si128((const __m128i*)(data + i + 16));
        __m128i mm_new_sum1 = _mm_add_epi16(mm_sum1, mm1);
        __m128i mm_new_sum2 = _mm_add_epi16(mm_sum2, mm2);
        __m128i mm_neg_carry1 = _mm_cmpeq_epi16(
            _mm_adds_epu16(mm_sum1, mm1), mm_new_sum1);
        __m128i mm_neg_carry2 = _mm_cmpeq_epi16(
            _mm_adds_epu16(mm_sum2, mm2), mm_new_sum2);
        mm_sum1 = mm_new_sum1;
        mm_sum2 = mm_new_sum2;
        mm_carry1 = _mm_sub_epi16(mm_carry1,
            _mm_xor_si128(mm_neg_carry1, mm_all_ones));
        mm_carry2 = _mm_sub_epi16(mm_carry2,
            _mm_xor_si128(mm_neg_carry2, mm_all_ones));
    }

    // If size is not a multiple of 32, process the tail bytes here
    // ...

    // Combine the accumulated sum and carry. Note that adding sums
    // may overflow, and we need to account the carry from this addition.
    mm_sum1 = parallel_add_with_carry(mm_sum1, mm_sum2);
    mm_carry1 = _mm_add_epi16(mm_carry1, mm_carry2);
    mm_sum1 = _mm_add_epi16(mm_sum1, mm_carry1);

    return mm_sum1;
}

Upvotes: 3

Devvy
Devvy

Reputation: 45

I found the solution using AVX-512 instructions. The solution is the merging-masking functionality using the CPU's mask registers (named K0 through K7).

I have tested it on a server box with an AMD Epyc 7000 series (48 cores).

C source code with inline assembly

In the context of my initial source code:

    __m128i parallel_add_with_carry_simd ( __m128i n1, __m128i n2 )
    {
      static const uint16vec carry = { .y[0] = 0x0001,
                                       .y[1] = 0x0001,
                                       .y[2] = 0x0001,
                                       .y[3] = 0x0001,
                                       .y[4] = 0x0001,
                                       .y[5] = 0x0001,
                                       .y[6] = 0x0001,
                                       .y[7] = 0x0001 };
      uint16vec res;
    
      __asm__
      (
        "movdqa    %1    , %%xmm0                 \n\t"
        "movdqa    %2    , %%xmm1                 \n\t"
        "movdqa    %3    , %%xmm3                 \n\t"
        "movdqa    %%xmm0, %%xmm2                 \n\t"
        "vpaddw    %%xmm1, %%xmm0, %%xmm0         \n\t"
        "vpcmpltuw %%xmm2, %%xmm0, %%k1           \n\t"
        "vpaddw    %%xmm3, %%xmm0, %%xmm0 %{%%k1%}\n\t"
        "movdqa    %%xmm0, %0                     \n\t"
        : "=m" (res.x)
        : "m" (n1), "m" (n2), "m" (carry.x)
        : "xmm0", "xmm1", "xmm2", "xmm3"
      );
    
      return res.x;
    }

Essentially, you have to execute the sequence:

  • First addition of eight 16bit unsigned integers in a 128bit XMM register;
  • Execution of 'vpcmpuw' with condition code 'LT' (lesser than);
  • Second CONDITIONAL addition of a carry vector (1,1,1,1,1,1,1,1).

This implements the old x86 'add with carry' mnemonic 'adc' on vectors of 16bit unsigned integers.

Walkthrough and technical background

The inline assembly loads the two vectors to be added with carry n1 and n2 into XMM0 and XMM1. The helper carry vector (1,1,1,1,1,1,1,1) gets loaded into XMM3.

We save the original vector n1 in XMM2 (for later comparison). The calculation of n1 + n2 is performed (no saturation!) according to:

    XMM0 := XMM0 (n1) + XMM1 (n2)

This intermediate result does not include the carries yet.

The role of vpcmpuw

The key is the comparison

    vpcmpltuw %%xmm2, %%xmm0, %%k1

meaning: Compare the intermediate result XMM0 with the original input n1 in XMM2. Use the condition code 'lesser than' (LT). Store the eight comparison results as separate bits in the mask register K1.

The second, conditional add operation

The second vpaddw instruction performs the job of 'carry correction' on the eight 16bit integers in XMM0:

     vpaddw %%xmm3, %%xmm0, %%xmm0 %{%%k1%}

This (weird) notation means: Add the carry vector (in XMM3) to the intermediate result (in XMM0) storing the overall result again in XMM0. Use the mode merging-masking (as opposite to zeroing-masking - see below) while adding. The pseudo code of merging-masking is:

    FORALL eight 16bit integers in XMM0 DO
    IF   : bit in K1 is 1 
    THEN : add 0x0001 from the carry vector in XMM3 (XMM0 := XMM0 + XMM3)
    ELSE : just copy the current 16bit integer from src (XMM0) to dst (XMM0)

What is zeroing-masking?

An operation alternative to merging-masking is zeroing-masking. You would have to specify the inline assembly in this case as follows:

    vpaddw %%xmm3, %%xmm0, %%xmm0 %{%%k1%}%{z%}

If a bit in the K1 register is one, then the addition is performed. If it is a zero bit, then 0x0000 is stored in the destination (thus zeroing the current 16bit integer).

Credits

Eventually, I could solve it thanks to this post: stackoverflow, writemask in AVX-512

Upvotes: 0

Related Questions