user49626
user49626

Reputation: 87

shortest way to find absolute min. of two number & multiply it with signs of its inputs in AVX

Any hint on how to implement AVX for below C logic without multiplications,

for(int i = 0;i<4096;i++)
{
   out[i] = sign(inp1[i])*sign(inp2[i])*min(abs(inp1[i]), abs(inp2[i])); 
}

// inp1, inp2 & out are 16 bit registers.

Upvotes: 2

Views: 248

Answers (2)

chtz
chtz

Reputation: 18807

There is pretty short (but non-obvious) solution to your problem:

res = max(min(a,b), -max(a,b));

(All min/max operations are signed)

To explain why this works, first let us set

A = min(a,b); B = max(a,b);

This essentially sorts a and b (and rules out the case that A>0 && B<0). We now just need to distinguish 3 cases:

A<0  && B<0:     res = -B 
A<0  && B>=0:    res = -min(-A, B) = max(A, -B)
A>=0 && B>=0:    res = A

Fortunately, the first and last case can also be calculated as max(A,-B), since in the first case A < 0 < -B, and in the last case -B <= 0 <= A.

Alternatively, you could just ask (and trust) WolframAlpha. (not really helpful, as it only evaluates to true "assuming a and b are positive" -- you could plot the difference between both expressions though)


Implementing this with AVX2 (ignoring loading and storing):

__m256i A = _mm256_min_epi16(a,b);
__m256i B = _mm256_max_epi16(a,b);
__m256i res = _mm256_max_epi16(A, _mm256_sub_epi16(_mm256_setzero_si256(), B));

The setzero operation will happen outside any loop, so for each packet there are three min/max-operations and one psub-operation. On Intel-CPUs the first execute on ports p01, while psub executes on any p015, so the loop would bottle-neck on p01, requiring 1.5cycles per packet.

As noted by @Soonts, the -B operation can overflow, for B=-0x8000 (there is no positive 0x8000 for signed int16). This only happens for a=b=-0x8000. If you prefer to output 0x7fff in that case, you can replace the subtraction by a saturated subtraction (_mm256_subs_epi16).

Upvotes: 4

Peter Cordes
Peter Cordes

Reputation: 364180

The sign(inp1[i])*sign(inp2[i]) part can almost exactly be implemented with _mm256_sign_epi16(in1, in2), and using that as the 2nd operand to another vpsignw to apply the sign of that to the min(abs,abs) result.

psignw negates or zeroes the first operand, depending on the 2nd operand being negative or zero. (Intrinsics guide). (We don't need the zeroing part of psignw: if either input is zero, unsigned min of their absolute values will be zero. But we have to avoid it depending on how we generate an input, if that can happen when neither of our real inputs are zero.)

There's a corner case where that's wrong: in1 = INT16_MIN = 0x8000, in2<0. The result of negating in1 would still be negative; thanks to the 2's complement most negative number not having an inverse.

If one of the 2 values can't be 0x8000, use that as the 1st arg to _mm256_sign_epi16 with no extra operations needed.

@chtz proposes a workaround strategy: XOR the inputs together to get the right value for the sign bit. But that will trigger vpsignw's zeroing behaviour for in1==in2 because in1^in2==0. You could or with set1(1) on the XOR result to make sure it's non-zero.

// pseudocode because the full intrinsic names are long and hard to read / type
    sign = (in1 ^ in2) | 1;
    out = psignw( min(abs1,abs2), sign);
  // operation count: XOR, OR, PSIGNW = 3 plus min(abs,abs)

On Skylake, vpsignw can run on execution ports p0 or p1. Booleans like vpxor and vpor can run on any of p0, p1, or p5. (https://uops.info/) So this way is potentially better than the other idea which uses psignw twice. It "couples" together the dependency chains of both operands earlier, by 1 instruction, but probably this will be throughput limited even if data is coming from another operation in the same pass.

pabsw and pminuw both also need p0 / p1, can't run on p5, so picking the same number of instructions but using ones that can utilize port 5 leads to a better balance of execution port pressure for the back-end on Skylake. Zen2 is somewhat similar, with booleans able to run on any FP execution port (0/1/2/3) but psignw / pabsw only FP0 / FP3, and pminuw only FP0/1/3.


Another option is to avoid psignw entirely instead of working around its zeroing behaviour: XOR and then broadcast the sign bit with arithmetic right shift, then implement conditional negation with the 2's complement identity -x = ~x - (-1). But that costs one more operation.

    sign = (in1 ^ in2) >> 15;   // pxor  psraw
    out =  (min(abs1,abs2) ^ sign) - sign;  // pxor, psubw
  // operation count: XOR, shift, XOR, SUB = 4 plus min(abs,abs)

Another workaround idea was _mm256_or_si256(in1, _mm256_set1_epi16(1)) before vpsignw to make sure the value has the same sign but isn't INT16_MIN.

// not as good as 
   sign = psignw(in1 | 1, in2);   // VPOR, VPSIGNW
   out = psignw( min(abs1,abs2), sign);
// operation count: OR, 2x PSIGNW = 3 plus min(abs,abs)

An arithmetic right shift by 1 wouldn't be safe: it could make the operand zero when the input was 1, resulting in a final output of zero for an input of 1, 2


IDK if there's any clever trick that would be better than vpabsw on each input separately to feed vpminuw

Upvotes: 3

Related Questions