arunmoezhi
arunmoezhi

Reputation: 3200

vectorize a loop having indirect access

I have the below snippet which is a hotspot in an application. The for loop is not vectorized due to vector dependance. Is there a way to rewrite this loop to make it run faster.

#define  NUM_KEYS  (1L << 20)
#define  NUM_BUCKETS (1L << 10)    
int i,k;
int shift = (1L << 11);
int key_array[NUM_KEYS],key_buff[NUM_KEYS];
int bucket_ptrs[NUM_BUCKETS];

for( i=0; i<NUM_KEYS; i++ )  
{
    k = key_array[i];
    key_buff[bucket_ptrs[k >> shift]++] = k;
}

One approach I tried was to create a temporary array to hold the shifted values of key_array.

for( i=0; i<NUM_KEYS; i++ )  
{
        key_arrays[i] = key_array[i] >> shift;
}
for( i=0; i<NUM_KEYS; i++ )  
{
    k = key_array[i];
    j = key_arrays[i];
    key_buff[bucket_ptrs[j]++] = k;
}

Here the first loop gets vectorized. But overall there is no improvement in performance.

Upvotes: 1

Views: 587

Answers (2)

Mysticial
Mysticial

Reputation: 471229

Why is the loop not vectorizable?

This is because you have non-sequential memory access here:

key_buff[bucket_ptrs[k >> shift]++] = k;

bucket_ptrs determines the index for accessing key_buff. Since these indices are all over the place, the memory access is non-sequential.

Currently, x86 processors only support SIMD loads/stores to contiguous chunks of memory. (ideally aligned as well)

If you want it to vectorize, you will need AVX2's gather/scatter instructions. Those don't exist it, but should be coming out in the next generation of Intel processors.

Here the first loop gets vectorized. But overall there is no improvement in performance.

This is because you're adding additional loop-overhead. So you're now making two passes over key_array. If anything, I'm surprised that it isn't slower.

Is there a way to rewrite this loop to make it run faster.

I doubt it - at least not without altering the algorithm. At the very least, you will want key_buff to fit comfortably into your L1 cache.

AVX2 will let it vectorize, but the problem is that key_buff is 4MB. That won't fit into the lower level caches. So even AVX2 might not help much. You will be bound completely by memory access.

Upvotes: 3

tmyklebu
tmyklebu

Reputation: 14205

What's probably biting you is that, while your accesses to key_array are sequential and bucket_ptrs is small enough to fit in L1, your accesses to key_buff are all over the place.

What you're doing looks like a step in a radix sort, though. If that's actually what you're doing, you might gain performance by first reducing the number of buckets to 32 or 64 or thereabouts, sorting on the most significant five or six bits. Then you've got a whole bunch of little arrays most of which probably fit in cache, and you can sort each of them using another pass of radix sort.

Upvotes: 2

Related Questions