user173342
user173342

Reputation: 1840

Vectorizing a conditional involving shorts

I'm using a compact struct of 2 unsigned shorts indicating a start and end position.
I need to be able to quickly determine if there are any Range objects with a length (difference from start to end) past a threshold value.

I'm going to have a huge quantity of objects each with their own Range array, so it is not feasible to track which Range objects are above the threshold in a list or something. This code is also going to be run very often (many times a second for each array), so it needs to be efficient.

struct Range
{
 unsigned short start;
 unsigned short end;
}

I will always have an array of Range sized 2^n. While I would like to abort as soon as I find something over the threshold, I'm pretty sure it'd be faster to simply OR it all together and check at the end... assuming I can vectorize the loop. Although if I could do an if statement on the chunk of results for each vector, that would be grand.

size_t rangecount = 1 << resolution;
Range* ranges = new Range[rangecount];

...

bool result = false;
for (size_t i = 0; i < ranges; ++i)
{
 result |= (range[i].end - range[i].start) > 4;
}

Not surprisingly, the auto-vectorizer gives the 1202 error because my data type isn't 32 or 64 bits wide. I really don't want to double my data size and make each field an unsigned int. So I'm guessing the auto-vectorizer approach is out for this.

Are there vector instructions that can handle 16 bit variables? If there are, how could I use them in c++ to vectorize my loop?

Upvotes: 6

Views: 200

Answers (1)

Ben Voigt
Ben Voigt

Reputation: 283773

You are wondering if any value is greater than 4?

Yes, there are SIMD instructions for this. It's unfortunate that the auto-vectorized isn't able to handle this scenario. Here's a vectorized algorithm:

diff_v = end_v - start_v; // _mm_hsub_epi16 
floor_v = max(4_v, diff_v); // _mm_max_epi16 
if (floor_v != 4_v) return true; // wide scalar comparison

Use _mm_sub_epi16 with a structure of arrays or _mm_hsub_epi16 with an array of structures.

Actually since start is stored first in memory, you will be working on start_v - end_v, so use _mm_min_epi16 and a vector of -4.

Each SSE3 instruction will perform 8 comparisons at a time. It will still be fastest to return early instead of looping. However, unrolling the loop a bit more may buy you additional speed (pass the first set of results into the packed min/max function to combine them).

So you end up with (approximately):

most_negative = threshold = _mm_set_epi64(0xFCFCFCFCFCFCFCFC); // vectorized -4

loop:
    a = load from range;
    b = load from range;
    diff = _mm_hsub_epi16(a, b);
    most_negative = _mm_min_epi16(most_negative, diff);

    // unroll by repeating the above four instructions 4 times or so
    if (most_negative != threshold) return true;
repeat loop

Upvotes: 1

Related Questions