Reputation: 431
PCMPGTQ doesn't exist on SSE2 and doesn't natively work on unsigned integers. Our goal here is to provide backward-compatible solutions for unsigned 64-bit comparisons so we can include them into the WebAssembly SIMD standard.
This is a sister question to this one for ARMv7+NEON: What is the most efficient way to do SIMD unsigned 64 bit comparison (CMHS) on ARMv7 with NEON?
And is related to the questions for signed comparisons variants already answered for SSE2 and Neon:
How to simulate pcmpgtq on sse2?
What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon?
Upvotes: 2
Views: 989
Reputation: 1185
The carry out from a subtract is an unsigned comparision predicate.
We can capture the carry out using same trick that is used when averaging two numbers. Only here we'll base it on a half-subtractor instead of a half-adder.
__m128i sse2_cmpgt_epu64(__m128i a, __m128i b) {
b = _mm_xor_si128(b, a); // diff
a = _mm_and_si128(a, b); // borrow `a & ~b`
b = _mm_sub_epi64(_mm_srli_epi64(b, 1), a);
return _mm_shuffle_epi32(_mm_srai_epi32(b, 31), _MM_SHUFFLE(3,3,1,1));
}
Translated from Hacker's Delight:
static
__m128i sse2_cmpgt_epu64(__m128i a, __m128i b) {
__m128i r = _mm_andnot_si128(_mm_xor_si128(b, a), _mm_sub_epi64(b, a));
r = _mm_or_si128(r, _mm_andnot_si128(b, a));
return _mm_shuffle_epi32(_mm_srai_epi32(r, 31), _MM_SHUFFLE(3,3,1,1));
}
Concept: If mixed "signs" (unsigned MSBs) then return a
else return b - a
(MSB(a) ^ MSB(b)) ? a : b - a; // result in MSB
This makes sense:
a
's MSB is set and b
's isn't, a
is unsigned above (so MSB(a) is our result)b
's MSB is set and a
's isn't, a
is unsigned below (so MSB(a) is our result)b-a
is effectively a 63-bit subtraction. The MSBs will cancel and the MSB of b-a
will be equal to the "borrow" output which tells you if a
is strictly above b
. (Like the CF flag for scalar sub
. jb
is jc
). So MSB(b-a) is our result.Note that the SIMD andnot/and/or is a bit-blend, but we only care about the MSB. We broadcast it with srai -> shuffle_epi32, discarding the garbage in lower bits. (Or with SSE3, movshdup
as described in @Soont's answer.)
It differs from signed compare:
(MSB(a) ^ MSB(b)) ? ~a : b - a; // result in MSB
If signs are mixed then the sign of ~a
is also the sign of b
, of course.
Upvotes: 4
Reputation: 21936
Here you go.
__m128i cmpgt_epu64_sse2( __m128i a, __m128i b )
{
// Compare uint32_t lanes for a > b and a < b
const __m128i signBits = _mm_set1_epi32( 0x80000000 );
a = _mm_xor_si128( a, signBits );
b = _mm_xor_si128( b, signBits );
__m128i gt = _mm_cmpgt_epi32( a, b );
__m128i lt = _mm_cmpgt_epi32( b, a );
// It's too long to explain why, but the result we're after is equal to ( gt > lt ) for uint64_t lanes of these vectors.
// Unlike the source numbers, lt and gt vectors contain a single bit of information per 32-bit lane.
// This way it's much easier to compare them with SSE2.
// Clear the highest bit to avoid overflows of _mm_sub_epi64.
// _mm_srli_epi32 by any number of bits in [ 1 .. 31 ] would work too, only slightly slower.
gt = _mm_andnot_si128( signBits, gt );
lt = _mm_andnot_si128( signBits, lt );
// Subtract 64-bit integers; we're after the sign bit of the result.
// ( gt > lt ) is equal to extractSignBit( lt - gt )
// The above is only true when ( lt - gt ) does not overflow, that's why we can't use it on the source numbers.
__m128i res = _mm_sub_epi64( lt, gt );
// Arithmetic shift to broadcast the sign bit into higher halves of uint64_t lanes
res = _mm_srai_epi32( res, 31 );
// Broadcast higher 32-bit lanes into the final result.
return _mm_shuffle_epi32( res, _MM_SHUFFLE( 3, 3, 1, 1 ) );
}
If SSE3 is available, movshdup
is also a good option instead of the pshufd
(_mm_shuffle_epi32) to copy the srai result to the low dword in each element. (Or optimize that away if the next use is movmskpd
or something else which only depends on the high part of each qword).
For example, on Conroe/Merom (first gen Core 2, SSSE3 and most SIMD execution units are 128-bit wide but the shuffle unit has limitations), pshufd
is 2 uops, 3 cycle latency (flt->int domain). movshdup
is only 1 uop, 1 cycle latency because its hard-wired shuffle is only within each 64-bit half of a register. movshdup
runs in the "SIMD-int" domain so it doesn't cause any extra bypass latency between integer shift and whatever integer thing you do next, unlike pshufd
. (https://agner.org/optimize/)
If you're JITing, you'd only ever be using this on CPUs without SSE4.2, which means Intel before Nehalem, AMD before Bulldozer. Note that psubq
(_mm_sub_epi64
) is somewhat slower than narrower psub
on some of those CPUs, but it's still the best option.
For completeness, here’s SSSE3 version (not quite the same thing as SSE3), saves a few instructions at the cost of a constant load. The only way to find out if it’s faster or slower — test on old computers.
__m128i cmpgt_epu64_ssse3( __m128i a, __m128i b )
{
// Compare uint32_t lanes for a > b and a < b
const __m128i signBits = _mm_set1_epi32( 0x80000000 );
a = _mm_xor_si128( a, signBits );
b = _mm_xor_si128( b, signBits );
__m128i gt = _mm_cmpgt_epi32( a, b );
__m128i lt = _mm_cmpgt_epi32( b, a );
// Shuffle bytes making two pairs of equal uint32_t values to compare.
// Each uint32_t combines two bytes from lower and higher parts of the vectors.
const __m128i shuffleIndices = _mm_setr_epi8(
0, 4, -1, -1,
0, 4, -1, -1,
8, 12, -1, -1,
8, 12, -1, -1 );
gt = _mm_shuffle_epi8( gt, shuffleIndices );
lt = _mm_shuffle_epi8( lt, shuffleIndices );
// Make the result
return _mm_cmpgt_epi32( gt, lt );
}
Upvotes: 2