Reputation: 291

Kernel attributes in the two-stage reduction example in OpenCL proposed by AMD

I have some problems understanding the two-stage reduction algorithm described here.

__kernel
void reduce(__global float* buffer,
        __local float* scratch,
        __const int length,
        __global float* result) {

  int global_index = get_global_id(0);
  float accumulator = INFINITY;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator = (accumulator < element) ? accumulator : element;
    global_index += get_global_size(0);
  }

  // Perform parallel reduction
  int local_index = get_local_id(0);
  scratch[local_index] = accumulator;
  barrier(CLK_LOCAL_MEM_FENCE);
  for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset = offset / 2) {
    if (local_index < offset) {
      float other = scratch[local_index + offset];
      float mine = scratch[local_index];
      scratch[local_index] = (mine < other) ? mine : other;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
  }
  if (local_index == 0) {
    result[get_group_id(0)] = scratch[0];
  }
}

I understand the basic idea, but I am not sure about the while-loop. As far as I inderstand, the attribute length specifies the number of elements in the buffer, i.e. how many elements do I want to process at all. But get_global_size returns the global number of work-items. Aren't length and get_global_size equal then? This would mean that the while-loop condition wil be satisfied only once. Shouldn't we use get_local_size instead of get_global_size?

Upvotes: 1

Answers (3)

Md Monjur Ul Hasan

Reputation: 1791

[FULL Description]

Overview:

This is a Two Stage reduction, which outperform the recursive multistage reduction by reducing synchronizations/barrier and overheads, and keeping all the computing unit as busy as possible. Before understand the kernel, it is important to understand work items and work groups configuration set by the host program and the parameters of the kernel. In this example, the task was to find the min value of N float numbers. The configurations are given below.

Setup:

The work group configuration are, the host sets up K number of work items (K < N) and P work groups. Each work group will have Q work items where K=P*Q. It is preferable that N%K==0, but not necessary.

The parameters and dimensions of the kernel are: 1) The first argument is an N size array contains N data elements (candidate data for finding min); 2) The second argument is an empty array of size Q; 3) The value of length is equal to N;and 4) the result is an array of size P.

Workflow: Stage 1

The work flow is as given below: If N%K== 0, each work item initially find the minimum value among N/K data elements, where the data elements are apart from each other by K items. The while loop does this task. If N%K != 0, some of the work item calculate min of ceil(N/K) elements and the rest of the work items find min of floor(N/K) elements.(as explained in the above answer by Kretab Chabawenizc).

The findings of each of these work items are initially stored in the local variable accumulator and then finally saved into the local array scratch. Once all the work items are done with this part of work (ensured by the barrier(CLK_LOCAL_MEM_FENCE)) the kernel start acting as a recursive parallel reduction kernel. Work items from a specific work group consider scratchpad as the data items array and each of the work items then reduce it by iteration (the for loop does this. Read the actual AMD documentation to get more explanation).

Finally the first P elements of result will contain the minimum value find by each of the P work groups.

Workflow: Stage 2

Now the second stage starts; and in this stage the same kernel can be invoked for P work items and 1 work group. The result array will be the first argument of the kernel this time and an one element array will be the last argument of the kernel to receive the final result.

In this run, the while loop will not do anything significant but just copy the values from the buffer to scratch. Thus you can come up with a more optimized kernel and use that for the second stage.

Upvotes: 0

user703016

Reputation: 37945

Aren't length and get_global_size equal then?

Not necessarily. It is common to launch less work items than there are data elements, and have each work item process more than one element. This way, you can decouple your input data size from the number of work items.

In this case, the following:

// Loop sequentially over chunks of input vector
while (global_index < length) {
    float element = buffer[global_index];
    accumulator = (accumulator < element) ? accumulator : element;
    global_index += get_global_size(0);
}

Performs a min-reduction of an array that resides in global memory. Basically, the work group will "slide" over the input vector, and at every iteration, each work item will update its minimum.

Here's a fictitious numerical example where we launch 2 work groups of 4 work-items over an array of 20 elements. xN represents the Nth element from the input array, aN and bN represent the Nth work item from work group a and b, respectively. Therefore the while condition is met between 2 to 3 times, depending on the work item id:

length: 20
get_global_size(): 8
get_local_size(): 4

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 Input array
--------------------------------------------------------------------- Iterations
a0 a1 a2 a3 b0 b1 b2 b3                                               0
                        a0 a1 a2  a3  b0  b1  b2  b3                  1
                                                      a0  a1  a2  a3  2

When the while loop finishes, every work item will have computed a minimum over a subset of the input array. For example, a0 will have computed min(x0, x8, x16) and b0 will have computed min(x4, x12).

Then the work items write the minimum they computed to local memory, and the work groups proceed to do a min-reduction (with a reduction tree) in local memory. Their result is written back to global memory, and presumably the kernel is called again with result as the new array to min-reduce, until the final result is a single element.

Upvotes: 3

Dithermaster

Reputation: 6343

The global size may be larger than the length because in OpenCL 1.x the global size must be a whole number multiple of the work group size. Therefore the global size might have been rounded up from the data size (length). For example, if length was 1000 but the work group size was 128 then the global size would be 1024.

Upvotes: 0

Kernel attributes in the two-stage reduction example in OpenCL proposed by AMD

Answers (3)

Related Questions