Reputation: 26975
I have the following bottleneck function.
typedef unsigned char byte;
void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
const byte b1 = 128-30;
const byte b2 = 128+30;
for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) {
*p3 = (*p1 < *p2 ) ? b1 : b2;
}
}
I want to replace C++
code with SSE2 intinsic functions. I have tried _mm_cmpgt_epi8
but it used signed compare. I need unsigned compare.
Is there any trick (SSE, SSE2, SSSE3) to solve my problem?
Note: I do not want to use multi-threading in this case.
Upvotes: 6
Views: 1377
Reputation: 1
Unfortunately, many of the answers above are incorrect. Let's assume a 3-bit word:
unsigned: 4 5 6 7 0 1 2 3 == signed: -4 -3 -2 -1 0 1 2 3 (bits: 100 101 110 111 000 001 010 011)
The method by Paul R is incorrect. Suppose we want to know if 3 > 2. min(3,2) == 2, which suggests yes, so the method works here. Now suppose we want to know if if 7 > 2. The value 7 is -1 in signed representation, so min(-1,2) == -1, which suggests wrongly that 7 is not greater than 2 unsigned.
The method by Andrey is also incorrect. Suppose we want to know if 7 > 2, or a = 7, and b = 2. The value 7 is -1 in signed representation, so the first term (a > b) fails, and the method suggests that 7 is not greater than 2.
However, the method by BJobnh, as corrected by Alexey, is correct. Just subtract 2^(n-1) from the values, where n is the number of bits. In this case, we would subtract 4 to obtain new corresponding values:
old signed: -4 -3 -2 -1 0 1 2 3 => new signed: 0 1 2 3 -4 -3 -2 -1 == new unsigned 0 1 2 3 4 5 6 7.
In other words, unsigned_greater_than(a,b) is equivalent to signed_greater_than(a - 2^(n-1), b - 2^(n-1)).
Upvotes: -1
Reputation: 212929
Instead of offsetting your signed values to make them unsigned, a slightly more efficient way would be to do the following:
_mm_min_epu8
to get the unsigned min of p1, p2_mm_cmpeq_epi8
_mm_or_si128
and _mm_andc_si128
to select the appropriate b1/b2 valuesNote that this is 4 instructions in total, compared with 5 using the offset + signed comparison approach.
Upvotes: 9
Reputation: 41374
Yes, this can be done in SIMD, but it will take a few steps to make the mask.
Ruslik got it right, I think. You want to xor each component with 0x80 to flip the sense of the signed and unsigned comparison. _mm_xor_si128 (PXOR
) gets you that -- you'll need to create the mask as a static char array somewhere before loading it into a SIMD register. Then _mm_cmpgt_epi8 gets you a mask and you can use a bitwise AND (eg _mm_and_si128
) to perform a masked-move.
Upvotes: 2
Reputation: 64203
You can subtract 127 from your numbers, and then use _mm_cmpgt_epi8
Upvotes: 2
Reputation: 43311
Yes, SSE will not work here. You can improve this code performance on multi-core computer by using OpenMP:
void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3) { const byte b1 = 128-30; const byte b2 = 128+30; int n = p1End - p1Start; #pragma omp parallel for for (int i = 0; i < n; ++p1, ++i) { p3[i] = (p1[i] < p2[i]) ? b1 : b2; } }
Upvotes: 1