Reputation: 12174
I'm using the _mm_cmpgt_epi64
intrinsic to implement a 128-bit addition, and later a 256-bit one.
Looking at the result of this intrinsic something puzzles me.
I don't understand why the computed mask is the way it is.
const __m128i mask = _mm_cmpgt_epi64(bflip, sumflip);
And here's the output in my debugger:
(lldb) p/x bflip
(__m128i) $1 = (0x00000001, 0x80000000, 0x00000000, 0x80000000)
(lldb) p/x sumflip
(__m128i) $2 = (0x00000000, 0x80000000, 0xffffffff, 0x7fffffff)
(lldb) p/x mask
(__m128i) $3 = (0xffffffff, 0xffffffff, 0x00000000, 0x00000000)
For the first 64-bit lane (63:0
) I'm ok. But why the second lane (127:64
) is not full of ones too?
It seems to me that 0x8000000000000000
> 0x7fffffffffffffff
.
Upvotes: 0
Views: 474
Reputation: 364458
It appears you're printing it in 32-bit chunks, not 64-bit, so that's weird.
But anyway, it's a signed two's complement integer compare, as documented in the manual: http://felixcloutier.com/x86/PCMPGTQ.html
0x8000000000000000
is the most negative 64-bit integer, while 0x7fffffffffffffff
is the largest positive.
If you want an unsigned compare, you need to range-shift both inputs by flipping their sign bit. Logically this is subtracting 2^63 to go from 0..2^64-1 to -2^63 .. 2^63-1. But we can do it with a more efficient XOR, because XOR is add-without-carry, and the carry/borrow-out goes off the end of the register.
const __m128i rangeshift = _mm_set1_epi64x(0x8000000000000000);
const __m128i mask = _mm_cmpgt_epi64(_mm_xor_si128(bflip, rangeshift), _mm_xor_si128(sumflip, rangeshift));
Or use AVX512F __mmask8 _mm512_cmp[eq|ge|gt|le|lt|neq]_epu64_mask( __m512i a, __m512i b)
Upvotes: 1