Reputation: 3200
I have the below snippet which is a hotspot in an application.
The for
loop is not vectorized due to vector dependance.
Is there a way to rewrite this loop to make it run faster.
#define NUM_KEYS (1L << 20)
#define NUM_BUCKETS (1L << 10)
int i,k;
int shift = (1L << 11);
int key_array[NUM_KEYS],key_buff[NUM_KEYS];
int bucket_ptrs[NUM_BUCKETS];
for( i=0; i<NUM_KEYS; i++ )
{
k = key_array[i];
key_buff[bucket_ptrs[k >> shift]++] = k;
}
One approach I tried was to create a temporary array to hold the shifted values of key_array
.
for( i=0; i<NUM_KEYS; i++ )
{
key_arrays[i] = key_array[i] >> shift;
}
for( i=0; i<NUM_KEYS; i++ )
{
k = key_array[i];
j = key_arrays[i];
key_buff[bucket_ptrs[j]++] = k;
}
Here the first loop gets vectorized. But overall there is no improvement in performance.
Upvotes: 1
Views: 587
Reputation: 471229
Why is the loop not vectorizable?
This is because you have non-sequential memory access here:
key_buff[bucket_ptrs[k >> shift]++] = k;
bucket_ptrs
determines the index for accessing key_buff
. Since these indices are all over the place, the memory access is non-sequential.
Currently, x86 processors only support SIMD loads/stores to contiguous chunks of memory. (ideally aligned as well)
If you want it to vectorize, you will need AVX2's gather/scatter instructions. Those don't exist it, but should be coming out in the next generation of Intel processors.
Here the first loop gets vectorized. But overall there is no improvement in performance.
This is because you're adding additional loop-overhead. So you're now making two passes over key_array
. If anything, I'm surprised that it isn't slower.
Is there a way to rewrite this loop to make it run faster.
I doubt it - at least not without altering the algorithm. At the very least, you will want key_buff
to fit comfortably into your L1 cache.
AVX2 will let it vectorize, but the problem is that key_buff
is 4MB. That won't fit into the lower level caches. So even AVX2 might not help much. You will be bound completely by memory access.
Upvotes: 3
Reputation: 14205
What's probably biting you is that, while your accesses to key_array are sequential and bucket_ptrs is small enough to fit in L1, your accesses to key_buff are all over the place.
What you're doing looks like a step in a radix sort, though. If that's actually what you're doing, you might gain performance by first reducing the number of buckets to 32 or 64 or thereabouts, sorting on the most significant five or six bits. Then you've got a whole bunch of little arrays most of which probably fit in cache, and you can sort each of them using another pass of radix sort.
Upvotes: 2