fantasie
fantasie

Reputation: 307

How to perform addition of two vectors of 8-bit integers with a single addition in C/C++

From my reading of this answer, it is possible to perform addition of a pair of two 4-bit integers with just one addition and some bitwise operations, and the author of that question states it should be easy to generalize that method a bit. Thus, I try to generalize this method to the addition of 4 dimensional vectors of 8-bit integers, stored in a 32-bit integer.

uint32_t a, b;

uint32_t c = a + b;
uint32_t r = a ^ b ^ c; // calculate all carry
uint32_t s = r & (0x01010100); // carry of only the digits that we care
uint32_t sum = c - s; // undo the carry on these digits

This code works for most input, but I found exceptions when I try to add integers that are basically complement form of negative numbers close to 0. For example, if I set

a = 0xfffeff00 // (-1,-2,-1, 0)
b = 0x01010100 // ( 1, 1, 1, 0)

expected result is 0x00ff0000. It runs

c = 0x01000000 // ( 1, 0, 0, 0)
r = 0xfffffe00 // (-1,-1,-2, 0)
s = 0x01010000 // ( 1, 1, 0, 0)
sum=0xffff0000 // (-1,-1, 0, 0)

The first two digits of sum becomes ff (which represents -1), but the correct number should be 0. This is because when subtracting c = 01 00 00 00 with s = 01 01 00 00, the subtraction of the second group overflows and affects the result on the first group.

I want to improve the last step to get rid of this problem, but can't figure a good solution.

Upvotes: 5

Views: 316

Answers (4)

nielsen
nielsen

Reputation: 7759

The following is a solution:

    uint32_t c = (a & 0x7f7f7f7f) + (b & 0x7f7f7f7f);
    uint32_t d = (a & 0x80808080) + (b & 0x80808080) + (c & 0x80808080);
    uint32_t sum = (c & 0x7f7f7f7f) | (d & 0x80808080);

The idea is to first add the lower 7 bits of each octet which gives the correct lower 7 bits of each octet of the result without spilling carries into the next octet. Then add the upper bits of each octet along with the carries from the first addition to get the correct upper bit of each octet of the result. Finally, the result is "glued" together from the correct bits in each part.

Update: As correctly commented, this can be simplified by replacing the second line with:

    uint32_t d = a ^ b ^ c;

Upvotes: 4

Ben Voigt
Ben Voigt

Reputation: 283753

If you are able to solve this problem by using "two additions with no dependency", please write an answer.

No problem, we just mask to select non-adjacent inputs and prevent carry in from other elements:

using quad8_t = uint32_t;
quad8_t a, b;  // inputs

const quad_t oddmask = 0x00ff00ffu, evenmask = 0xff00ff00u;
quad8_t oddsums = ((a & oddmask) + (b & oddmask)) & oddmask;
quad8_t evensums = ((a & evenmask) + (b & evenmask)) & evenmask;
quad8_t sum = oddsums | evensums;

Easily generalizes to any number of packed words in the input.

The additions are each dependent on a and b but neither is dependent on the other addition.

Upvotes: 1

John Bollinger
John Bollinger

Reputation: 181104

You write that the problem arises

because when subtracting c = 01 00 00 00 with s = 01 01 00 00, the subtraction of the second group overflows and affects the result on the first group.

That's partially correct, but note that the original addition also overflowed across groups, and you do need to correct for that. In fact, the problem arises only in the more narrow case that during the addition, overflow from one group to the next causes the next group to overflow as well, where it otherwise wouldn't have done so. Then the overflow is properly corrected by subtracting the first carry only (and this will indeed overflow). Other carries in the chain should not also be corrected because the one correction does the whole job.

Thus, the criteria for a carry into byte n to be one that requires correction are:

  • in the raw sum, byte n - 1 is nonzero (so any carry from this byte is independent of any carry into it)

    OR

  • there was no carry into byte n - 1

That can be implemented in byte vector form like so:

uint32_t a = 0xfffeff00;
uint32_t b = 0x01010100;

uint32_t raw_sum = a + b;

// bits that received a carry
uint32_t raw_carry = a ^ b ^ raw_sum;

// bits that received a cross-byte carry
uint32_t xb_carry = raw_carry  & 0x01010100u;

// bytes of the raw sum that are nonzero (the lsb of each byte is the flag)
uint32_t bytes_nz = raw_sum | (raw_sum >> 4);
bytes_nz |= (bytes_nz >> 2);
bytes_nz |= (bytes_nz >> 1);

// the carries that need to be corrected
uint32_t unique_carry = xb_carry & ((bytes_nz | ~xb_carry) << 8);

// final result
uint32_t corrected_sum = raw_sum - unique_carry;

If you start with the assumption that you're going to proceed by evaluating a + b and then applying a correction for carries, as the question seems to indicate, then that does the job correctly and fairly efficiently.

That's a lot of operations, however -- quite possibly more than enough to offset any benefit from the manual vectorization. Compare this:

union bytevec { uint32_t as_uint; unsigned char as_chars[4]; };
union bytevec vec_a = { .as_uint = a };
union bytevec vec_b = { .as_uint = b };

union bytevec vec_result = { .as_chars = {
    vec_a.as_chars[0] + vec_b.as_chars[0],
    vec_a.as_chars[1] + vec_b.as_chars[1],
    vec_a.as_chars[2] + vec_b.as_chars[2],
    vec_a.as_chars[3] + vec_b.as_chars[3]
} };

return vec_result.as_uint;

It's certainly clearer, and it has fewer explicit operations, though what the compiler actually does with it might or might not be an improvement over the byte-vector approach.


Note: for either of those approaches to work with bytes (ultimately, elsewhere) interpreted as signed char, the C implementation must use two's complement representation for type signed char. You're unlikely to meet any other kind of implementation, but prior to C23, C does allow it.

Upvotes: 1

Alon Alush
Alon Alush

Reputation: 1075

You can do something like this:

uint32_t a, b;

uint32_t c = a + b;

uint32_t r = a ^ b ^ c;
uint32_t s = r & 0x01010100;
uint32_t sum = 0;
sum  = ((c      & 0xFF) - ( s       & 0xFF))       & 0xFF;
sum |= ((((c>>8) & 0xFF) - ((s>>8 ) & 0xFF)) & 0xFF) << 8;
sum |= ((((c>>16)& 0xFF) - ((s>>16) & 0xFF)) & 0xFF) << 16;
sum |= ((((c>>24)& 0xFF) - ((s>>24) & 0xFF)) & 0xFF) << 24;

Upvotes: 0

Related Questions