Reputation: 123
I was recently faced with a given problem:
There are 8 elements in the vector, each is represented by int8_t.
Implement an algorithm in x86_64 that will add two vectors (uint64_t type).
Adding elements should be done with saturation arithmetic taken into account.
E.g.:
80 + 60 = 127
(−40) + (−100) = −128
The biggest challenge turns out to be the restrictions imposed:
I can't think of any solution that fits these restrictions. Could anyone give me some hints? Examples in C are welcome.
I can use only "standard", transfer, arithmetic, logical instructions and standard registers:
Upvotes: 0
Views: 524
Reputation: 5040
Here is a version (tested and does not require imul) that takes 22 instructions when compiled with clang-16.
uint64_t add(uint64_t x, uint64_t y) {
uint64_t eq, xv, yv, satmask, satbits, satadd, t0, t1;
uint64_t signmask = 0x8080808080808080ULL;
eq = (x ^ ~y) & signmask;
xv = x & ~signmask;
yv = y & ~signmask;
xv += yv;
satbits = (xv ^ y) & eq;
satadd = satbits >> 7;
satmask = (satbits << 1) - satadd;
xv ^= eq;
t0 = (xv & ~satmask) ^ signmask;
t1 = satadd & ~(xv >> 7);
return t0 - t1;
}
Assembly:
mov rdx, rsi
xor rdx, rdi
not rdx
movabs r8, -9187201950435737472
and rdx, r8
movabs rcx, 9187201950435737471
and rdi, rcx
and rcx, rsi
add rcx, rdi
xor rsi, rcx
and rsi, rdx
lea rax, [rsi + rsi]
shr rsi, 7
xor rcx, rdx
not rax
add rax, rsi
and rax, rcx
xor rax, r8
shr rcx, 7
not rcx
and rcx, rsi
sub rax, rcx
Upvotes: 5
Reputation: 26175
The following code uses a pedestrian approach to byte-wise addition with signed saturation, but is very competitive in terms of instruction count and execution time with Falk Hüffner's excellent algorithm.
To avoid crossing byte-lane boundaries, the classical approach for emulated SIMD arithmetic is to perform the computation separately for the low-order seven bits and the most significant bits, then merge partial result. In this case this also helps with detecting signed integer overflow, one definition of which is that the carry-in to the most significant bit differs from the carry-out from that bit.
Signed integer overflow in addition can only occur when the signs of the addends are the same. If overflow occurs, the byte-size special result (spc
in the code below) is either 0x7f
or 0x80
, and this can therefore be computed from the sign of either addend.
The overflow flag is expanded into a full-byte mask of all-zeros or all-ones, and this is used to select either the regular addition result (res
in code below) or the special overflow result in a traditional multiplexing idiom.
The question lists various instructions from the BMI2 instruction set extension (introduced in 2013) as permissible, so I will assume that use of the andn
instruction from the BMI1 extension is likewise allowed, although it is not explicitly listed in the question.
I developed my implementation epaddsb
on a Windows 10 machine, and the code therefore uses the Windows calling convention for x86-64. Changing this for the System V ABI used by Linux is trivial: simply exchange a few register names. For a comparison with Falk Hüffner's algorithm I compiled his C code with a recent Intel oneAPI compiler and captured the generated code in hpaddsb
.
epaddsb
requires 21 instructions without ret
, while hpaddsb
requires 20 instructions without ret
. The performance of the two variants is identical within measurement noise level of ±2% on my PC based on a Skylake CPU.
PUBLIC epaddsb
_TEXT SEGMENT
ALIGN 16
;; epaddsb(a,b): emulated byte-wise 64-bit addition with signed saturation
;;
;; Windows x86-64 calling convention:
;; function arguments: rcx, rdx, {r8, r9}
;; function return value: rax
;; scratch registers: rax, rcx, rdx, r8, r9, {r10, r11}
epaddsb PROC
mov rax, 7f7f7f7f7f7f7f7fh ; NMSB_MASK = ~MSB_MASK
mov r8, rcx ; a
mov r9, rdx ; b
and rcx, rax ; a & NMSB_MASK
and rdx, rax ; b & NMSB_MASK
xor r9, r8 ; sum = a ^ b
add rdx, rcx ; res = (a & NMSB_MASK) + (b & NMSB_MASK)
andn rcx, rax, r8 ; a & ~NMSB_MASK
xor r8, rdx ; res ^ a
shr rcx, 7 ; (a & ~NMSB_MASK) >> 7
andn r8, r9, r8 ; ofl = (res ^ a) & ~sum
add rcx, rax ; spc = ((a & ~ NMSB_MASK) >> 7) + NMSB_MASK
andn r9, rax, r9 ; sum & ~NSMB_MASK
xor rdx, r9 ; res = res ^ (sum & ~NMSB_MASK)
andn r8, rax, r8 ; ofl & ~NMSB_MASK
lea r9, [r8 + r8] ; ofl << 1
shr r8, 7 ; ofl >> 7
sub r9, r8 ; mask = (ofl << 1) - (ofl >> 7)
andn rax, r9, rdx ; res & ~mask
and rcx, r9 ; spc & mask
or rax, rcx ; res = (spc & mask) | (res & ~mask)
ret
epaddsb ENDP
ALIGN 16
;; Falk Hüffner's algorithm from https://stackoverflow.com/a/76090715/780717
;; Compiled by Intel(R) oneAPI DPC++/C++ compiler version 2023.0.0
hpaddsb PROC
mov rax, rdx ;
xor rax, rcx ;
mov r8, 8080808080808080h ;
andn r9, rax, r8 ;
mov r10, 7f7f7f7f7f7f7f7fh;
and rcx, r10 ;
and r10, rdx ;
add r10, rcx ;
xor rdx, r10 ;
and rdx, r9 ;
lea rax, [rdx + rdx] ;
shr rdx, 7 ;
xor r10, r9 ;
not rax ;
add rax, rdx ;
and rax, r10 ;
xor rax, r8 ;
shr r10, 7 ;
andn rcx, r10, rdx ;
sub rax, rcx ;
ret
hpaddsb ENDP
ALIGN 16
_TEXT ENDS
END
I am showing my test scaffolding below. I built as follows
ml64 /c paddsb.obj paddsb.asm
icx /W4 /Ox /QxHOST paddsb_stackoverflow.c paddsb.obj
using Microsoft Macro Assembler 14.27.29112.0 and Intel oneAPI DPC++/C++ Compiler 2023.0.0.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define NBR_TEST_CASES (1000000000)
#define TEST_HUEFFNER_ALGO (0)
/* emulated byte-wise 64-bit addition with signed saturation; in assembly */
extern uint64_t epaddsb (uint64_t a, uint64_t b); /* algorithm: N. Juffa */
extern uint64_t hpaddsb (uint64_t a, uint64_t b); /* algorithm: F. Hüffner */
/* reference function for byte-wise addition with signed saturation */
uint64_t paddsb_ref (uint64_t a, uint64_t b)
{
int8_t a0 = (int8_t)(uint8_t)(a >> 0);
int8_t a1 = (int8_t)(uint8_t)(a >> 8);
int8_t a2 = (int8_t)(uint8_t)(a >> 16);
int8_t a3 = (int8_t)(uint8_t)(a >> 24);
int8_t a4 = (int8_t)(uint8_t)(a >> 32);
int8_t a5 = (int8_t)(uint8_t)(a >> 40);
int8_t a6 = (int8_t)(uint8_t)(a >> 48);
int8_t a7 = (int8_t)(uint8_t)(a >> 56);
int8_t b0 = (int8_t)(uint8_t)(b >> 0);
int8_t b1 = (int8_t)(uint8_t)(b >> 8);
int8_t b2 = (int8_t)(uint8_t)(b >> 16);
int8_t b3 = (int8_t)(uint8_t)(b >> 24);
int8_t b4 = (int8_t)(uint8_t)(b >> 32);
int8_t b5 = (int8_t)(uint8_t)(b >> 40);
int8_t b6 = (int8_t)(uint8_t)(b >> 48);
int8_t b7 = (int8_t)(uint8_t)(b >> 56);
b0 = ((a0 + b0) > 127) ? 127 : (((a0 + b0) < (-128)) ? (-128) : (a0 + b0));
b1 = ((a1 + b1) > 127) ? 127 : (((a1 + b1) < (-128)) ? (-128) : (a1 + b1));
b2 = ((a2 + b2) > 127) ? 127 : (((a2 + b2) < (-128)) ? (-128) : (a2 + b2));
b3 = ((a3 + b3) > 127) ? 127 : (((a3 + b3) < (-128)) ? (-128) : (a3 + b3));
b4 = ((a4 + b4) > 127) ? 127 : (((a4 + b4) < (-128)) ? (-128) : (a4 + b4));
b5 = ((a5 + b5) > 127) ? 127 : (((a5 + b5) < (-128)) ? (-128) : (a5 + b5));
b6 = ((a6 + b6) > 127) ? 127 : (((a6 + b6) < (-128)) ? (-128) : (a6 + b6));
b7 = ((a7 + b7) > 127) ? 127 : (((a7 + b7) < (-128)) ? (-128) : (a7 + b7));
return (((uint64_t)(uint8_t)b0 << 0) | ((uint64_t)(uint8_t)b1 << 8) |
((uint64_t)(uint8_t)b2 << 16) | ((uint64_t)(uint8_t)b3 << 24) |
((uint64_t)(uint8_t)b4 << 32) | ((uint64_t)(uint8_t)b5 << 40) |
((uint64_t)(uint8_t)b6 << 48) | ((uint64_t)(uint8_t)b7 << 56));
}
/* https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J */
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
uint64_t res, ref, a, b, count = 0;
printf ("Testing %s's algo\n", TEST_HUEFFNER_ALGO ? "Hueffner" : "Juffa");
do {
a = KISS64;
b = KISS64;
ref = paddsb_ref (a, b);
#if TEST_HUEFFNER_ALGO
res = hpaddsb (a, b);
#else // TEST_HUEFFNER_ALGO
res = epaddsb (a, b);
#endif // TEST_HUEFFNER_ALGO
if (res != ref) {
printf ("error @ a=%016llx b=%016llx: res=%016llx ref=%016llx\n",
a, b, res, ref);
return EXIT_FAILURE;
}
count++;
} while (count < NBR_TEST_CASES);
printf ("test passed\n");
return EXIT_SUCCESS;
}
Upvotes: 1
Reputation: 93082
Use the paddsb
instruction to add vectors of bytes with signed saturation. The implementation could be like (assuming the amd64 sysv abi):
movq %rdi, %mm0 # move the first operand to an MMX register
movq %rsi, %mm1 # move the second operand to an MMX register
paddsb %mm1, %mm0 # packed add bytes with signed saturation
movq %mm0, %rax # move the result back to a scalar register
emms # end MMX mode
ret # return to caller
Without MMX, the following approach can be used. The idea is to perform the following algorithm on all bytes in parallel with SWAR techniques:
int8_t addsb(int8_t a, int8_t b) {
int8_t q = a + b;
/* can the addition overflow (are a and b of different sign?) */
if (((a ^ b) & 0x80) == 0) {
/* is the result of different sign? */
if (((a ^ q) & 0x80) != 0) {
/* if yes, overflow occurred */
return (a & 0x80 ? 0x80 : 0x7f);
}
}
return (q);
}
The following code is untested but should work:
paddsb: mov $0x0101010101010101, %rdx # LSB bit masks
lea (%rsi, %rdi, 1), %rax # q = a + b
mov %rdi, %rcx
xor %rsi, %rcx # a ^ b
mov %rax, %rbx
sub %rcx, %rbx # a + b - (a ^ b) (carry out)
and %rdx, %rbx # carry outs from one byte to the next
not %rcx # ~a ^ b
xor %rax, %rdi # a ^ q
sub %rbx, %rax # compensate for the carry out
and %rcx, %rdi # bit 7 set where overflow
shr $7, %rdi # bit 0 set where overflow
and %rdx, %rdi # 0x01 where overflow, 0x00 where not
imul $0xff, %rdi, %rdi # 0xff where overflow, 0x00 where not
shr $7, %rsi
and %rdx, %rsi # 0x01 where b negative, 0x00 where not
mov $0x7f7f7f7f7f7f7f7f, %rdx
add %rsi, %rdx # 0x80 where b negative, 0x7f where not
and %rdi, %rdx # masked to only where overflown
not %rdi # 0x00 where overflow, 0xff where not
and %rdi, %rax # q masked to only where not overflown
or %rdx, %rax # signed sum of a and b
ret
Note that some extra processing is needed to avoid carry out from one byte to the next.
Upvotes: 3
Reputation: 59263
I wrote it in C++ like this:
#include <cstdint>
uint64_t add(uint64_t a, uint64_t b) {
uint64_t asigns = a & 0x8080808080808080L;
uint64_t bsigns = b & 0x8080808080808080L;
uint64_t sum = (a^asigns) + (b^bsigns);
// fix up 8 bit wrapped sums
sum ^= asigns ^ bsigns;
uint64_t sumsigns = sum & 0x8080808080808080L;
// we saturate high when a and b were positive, but the result is negative
uint64_t sat = sumsigns & ~(asigns|bsigns);
sum |= (sat>>7)*127;
sum &= ~sat;
// we saturate negative when a and b were negative, but the result is positive
sat = (asigns&bsigns) & ~sumsigns;
sum &= ~((sat>>7)*127);
sum |= sat;
return sum;
}
Then I went over to https://godbolt.org/ to see what various compilers generate. clang-16 gives 33 instructions:
add(unsigned long, unsigned long):
movabs rdx, -9187201950435737472
mov rax, rdi
and rax, rdx
mov rcx, rsi
and rcx, rdx
movabs r8, 9187201950435737471
mov r9, rdi
and r9, r8
and r8, rsi
add r8, r9
xor rax, rcx
xor rax, r8
or rsi, rdi
not rsi
and rdx, rsi
and rdx, r8
mov rsi, rdx
shr rsi, 7
mov r8, rdx
sub r8, rsi
or r8, rax
xor r8, rdx
not rax
and rcx, rdi
and rcx, rax
mov rdx, rcx
shr rdx, 7
mov rax, rcx
sub rax, rdx
not rax
and rax, r8
or rax, rcx
ret
You can try the various other options.
Upvotes: 3