Need help understanding OpenCL reductions

Question

I have been reading the following note: http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/

The following kernel should reduce a chunk of data and theres one part of it i simply dont understand.

while (global_index < length) ....  global_index += get_global_size(0)

I belived that it was smarter to read data from global storage that was sequential layed out. Meaning reading data at k, k+1, k+2 is faster then reading k+1000, k+2000, k+3000. Is this not what they are doing when saying global_index += get_global_size(0) ?

__kernel
void reduce(__global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {

  int global_index = get_global_id(0);
  float accumulator = INFINITY;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator = (accumulator < element) ? accumulator : element;
    global_index += get_global_size(0);
  }

  // Perform parallel reduction
  int local_index = get_local_id(0);
  scratch[local_index] = accumulator;
  barrier(CLK_LOCAL_MEM_FENCE);
  for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset = offset / 2) {
    if (local_index < offset) {
      float other = scratch[local_index + offset];
      float mine = scratch[local_index];
      scratch[local_index] = (mine < other) ? mine : other;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
  }
  if (local_index == 0) {
    result[get_group_id(0)] = scratch[0];
  }
}

Eric Bainville · Accepted Answer

Work items 0,1,2,3,... will first read buffer indices 0,1,2,3,... in parallel (this is generally the best case for memory access), and then 1000,1001,1002,1003,... in parallel, etc.

Remember that each instruction in the kernel code will be executed "in parallel" by all work-items.

Need help understanding OpenCL reductions

Answers (1)

Related Questions