Reputation: 26975

How to optimize a cycle?

I have the following bottleneck function.

typedef unsigned char byte;
void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;
     for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) {
        *p3 = (*p1 < *p2 ) ? b1 : b2;
    }
}

I want to replace C++ code with SSE2 intinsic functions. I have tried _mm_cmpgt_epi8 but it used signed compare. I need unsigned compare.

Is there any trick (SSE, SSE2, SSSE3) to solve my problem?

Note: I do not want to use multi-threading in this case.

Upvotes: 6

Answers (6)

T Wu

Reputation: 1

Unfortunately, many of the answers above are incorrect. Let's assume a 3-bit word:

unsigned: 4 5 6 7 0 1 2 3 == signed: -4 -3 -2 -1 0 1 2 3 (bits: 100 101 110 111 000 001 010 011)

The method by Paul R is incorrect. Suppose we want to know if 3 > 2. min(3,2) == 2, which suggests yes, so the method works here. Now suppose we want to know if if 7 > 2. The value 7 is -1 in signed representation, so min(-1,2) == -1, which suggests wrongly that 7 is not greater than 2 unsigned.

The method by Andrey is also incorrect. Suppose we want to know if 7 > 2, or a = 7, and b = 2. The value 7 is -1 in signed representation, so the first term (a > b) fails, and the method suggests that 7 is not greater than 2.

However, the method by BJobnh, as corrected by Alexey, is correct. Just subtract 2^(n-1) from the values, where n is the number of bits. In this case, we would subtract 4 to obtain new corresponding values:

old signed: -4 -3 -2 -1 0 1 2 3 => new signed: 0 1 2 3 -4 -3 -2 -1 == new unsigned 0 1 2 3 4 5 6 7.

In other words, unsigned_greater_than(a,b) is equivalent to signed_greater_than(a - 2^(n-1), b - 2^(n-1)).

Upvotes: -1

Paul R

Reputation: 212929

Instead of offsetting your signed values to make them unsigned, a slightly more efficient way would be to do the following:

use _mm_min_epu8 to get the unsigned min of p1, p2
compare this min for equality with p2 using _mm_cmpeq_epi8
the resulting mask will now be 0x00 for elements where p1 < p2 and 0xff for elements where p1 >= p2
you can now use this mask with _mm_or_si128 and _mm_andc_si128 to select the appropriate b1/b2 values

Note that this is 4 instructions in total, compared with 5 using the offset + signed comparison approach.

Upvotes: 9

Crashworks

Reputation: 41374

Yes, this can be done in SIMD, but it will take a few steps to make the mask.

Ruslik got it right, I think. You want to xor each component with 0x80 to flip the sense of the signed and unsigned comparison. _mm_xor_si128 (PXOR) gets you that -- you'll need to create the mask as a static char array somewhere before loading it into a SIMD register. Then _mm_cmpgt_epi8 gets you a mask and you can use a bitwise AND (eg _mm_and_si128) to perform a masked-move.

Upvotes: 2

BЈовић

Reputation: 64203

You can subtract 127 from your numbers, and then use _mm_cmpgt_epi8

Upvotes: 2

Alex F

Reputation: 43311

Yes, SSE will not work here. You can improve this code performance on multi-core computer by using OpenMP:

void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;

     int n = p1End - p1Start;
     #pragma omp parallel for
     for (int i = 0; i < n; ++p1, ++i) 
     {
        p3[i] = (p1[i] < p2[i]) ? b1 : b2;
     }
}

Upvotes: 1

Alexander Solonsky

Reputation: 11

use pcmpeqb and be the Power with you.

Upvotes: -3

How to optimize a cycle?

Answers (6)

Related Questions