Reputation: 15
I am working on a simple parallel reduction algorithm to find the minimum value in an array and am coming across some interesting undefined behavior in my algorithm. I am running Intel's OpenCL 1.2 on Ubuntu 16.04.
The following kernel is what I am trying to run which is currently giving me the wrong answer:
__kernel void Find_Min(int arraySize, __global double* scratch_arr, __global double* value_arr, __global double* min_arr){
const int index = get_global_id(0);
int length = (int)sqrt((double)arraySize);
int start = index*length;
double min_val = INFINITY;
for(int i=start; i<start+length && i < arraySize; i++){
if(value_arr[i] < min_val)
min_val = value_arr[i];
}
scratch_arr[index] = min_val;
barrier(CLK_GLOBAL_MEM_FENCE);
if(index == 0){
double totalMin = min_val;
for(int i=1; i<length; i++){
if(scratch_arr[i] < totalMin)
totalMin = scratch_arr[i];
}
min_arr[0] = totalMin;
}
}
When in put in an array that is {0,-1,-2,-3,-4,-5,-6,-7,-8} it ends up returning -2.
Here is where the undefined behavior comes in. When I run the following kernel with a printf statement before the barrier I get the right answer (-8):
__kernel void Find_Min(int arraySize, __global double* scratch_arr, __global double* value_arr, __global double* min_arr){
const int index = get_global_id(0);
int length = (int)sqrt((double)arraySize);
int start = index*length;
double min_val = INFINITY;
for(int i=start; i<start+length && i < arraySize; i++){
if(value_arr[i] < min_val)
min_val = value_arr[i];
}
scratch_arr[index] = min_val;
printf("setting scratch[%i] to %f\n", index, min_val);
barrier(CLK_GLOBAL_MEM_FENCE);
if(index == 0){
double totalMin = min_val;
for(int i=1; i<length; i++){
if(scratch_arr[i] < totalMin)
totalMin = scratch_arr[i];
}
min_arr[0] = totalMin;
}
}
The only thing I can think of that could be happening is that I am using the barrier command incorrectly and all the printf is doing is causing a delay in the kernel that is somehow synchronizing the calls so they all complete before the final reduction step. But without the printf, the kernel 0 executes the final reduction before the other kernels are finished.
Does anyone else have any suggestions or tips on how to debug this issue?
Thanks in advance!!
Upvotes: 0
Views: 101
Reputation: 15
The problem was that the kernel was being launched with one thread per workgroup and barriers only work within a work group. See this response to a similar question: Open CL no synchronization despite barrier
Upvotes: 0