Reputation: 1840
I'm using a compact struct of 2 unsigned shorts indicating a start and end position.
I need to be able to quickly determine if there are any Range
objects with a length (difference from start to end) past a threshold value.
I'm going to have a huge quantity of objects each with their own Range
array, so it is not feasible to track which Range
objects are above the threshold in a list or something. This code is also going to be run very often (many times a second for each array), so it needs to be efficient.
struct Range
{
unsigned short start;
unsigned short end;
}
I will always have an array of Range
sized 2^n. While I would like to abort as soon as I find something over the threshold, I'm pretty sure it'd be faster to simply OR it all together and check at the end... assuming I can vectorize the loop. Although if I could do an if statement on the chunk of results for each vector, that would be grand.
size_t rangecount = 1 << resolution;
Range* ranges = new Range[rangecount];
...
bool result = false;
for (size_t i = 0; i < ranges; ++i)
{
result |= (range[i].end - range[i].start) > 4;
}
Not surprisingly, the auto-vectorizer gives the 1202 error because my data type isn't 32 or 64 bits wide. I really don't want to double my data size and make each field an unsigned int. So I'm guessing the auto-vectorizer approach is out for this.
Are there vector instructions that can handle 16 bit variables? If there are, how could I use them in c++ to vectorize my loop?
Upvotes: 6
Views: 200
Reputation: 283773
You are wondering if any value is greater than 4?
Yes, there are SIMD instructions for this. It's unfortunate that the auto-vectorized isn't able to handle this scenario. Here's a vectorized algorithm:
diff_v = end_v - start_v; // _mm_hsub_epi16
floor_v = max(4_v, diff_v); // _mm_max_epi16
if (floor_v != 4_v) return true; // wide scalar comparison
Use _mm_sub_epi16
with a structure of arrays or _mm_hsub_epi16
with an array of structures.
Actually since start
is stored first in memory, you will be working on start_v - end_v
, so use _mm_min_epi16
and a vector of -4
.
Each SSE3 instruction will perform 8 comparisons at a time. It will still be fastest to return early instead of looping. However, unrolling the loop a bit more may buy you additional speed (pass the first set of results into the packed min/max function to combine them).
So you end up with (approximately):
most_negative = threshold = _mm_set_epi64(0xFCFCFCFCFCFCFCFC); // vectorized -4
loop:
a = load from range;
b = load from range;
diff = _mm_hsub_epi16(a, b);
most_negative = _mm_min_epi16(most_negative, diff);
// unroll by repeating the above four instructions 4 times or so
if (most_negative != threshold) return true;
repeat loop
Upvotes: 1