Reputation: 17530
I have been reading the following note: http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/
The following kernel should reduce a chunk of data and theres one part of it i simply dont understand.
while (global_index < length) .... global_index += get_global_size(0)
I belived that it was smarter to read data from global storage that was sequential layed out. Meaning reading data at k, k+1, k+2 is faster then reading k+1000, k+2000, k+3000. Is this not what they are doing when saying global_index += get_global_size(0) ?
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;
offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
Upvotes: 0
Views: 795
Reputation: 9886
Work items 0,1,2,3,... will first read buffer indices 0,1,2,3,... in parallel (this is generally the best case for memory access), and then 1000,1001,1002,1003,... in parallel, etc.
Remember that each instruction in the kernel code will be executed "in parallel" by all work-items.
Upvotes: 1