acorso
acorso

Reputation: 15

OpenCL undefined behavior in parallel reduction algorithm

I am working on a simple parallel reduction algorithm to find the minimum value in an array and am coming across some interesting undefined behavior in my algorithm. I am running Intel's OpenCL 1.2 on Ubuntu 16.04.

The following kernel is what I am trying to run which is currently giving me the wrong answer:

__kernel void Find_Min(int arraySize, __global double* scratch_arr, __global double* value_arr, __global double* min_arr){

    const int index = get_global_id(0);
    int length = (int)sqrt((double)arraySize);
    int start = index*length;
    double min_val = INFINITY;
    for(int i=start; i<start+length && i < arraySize; i++){
        if(value_arr[i] < min_val)
            min_val = value_arr[i];
    }
    scratch_arr[index] = min_val;

    barrier(CLK_GLOBAL_MEM_FENCE);
    if(index == 0){
        double totalMin = min_val;
        for(int i=1; i<length; i++){
            if(scratch_arr[i] < totalMin)
                totalMin = scratch_arr[i];
        }
        min_arr[0] = totalMin;
    }
}

When in put in an array that is {0,-1,-2,-3,-4,-5,-6,-7,-8} it ends up returning -2.

Here is where the undefined behavior comes in. When I run the following kernel with a printf statement before the barrier I get the right answer (-8):

__kernel void Find_Min(int arraySize, __global double* scratch_arr, __global double* value_arr, __global double* min_arr){

    const int index = get_global_id(0);
    int length = (int)sqrt((double)arraySize);
    int start = index*length;
    double min_val = INFINITY;
    for(int i=start; i<start+length && i < arraySize; i++){
        if(value_arr[i] < min_val)
            min_val = value_arr[i];
    }
    scratch_arr[index] = min_val;
    printf("setting scratch[%i] to %f\n", index, min_val);

    barrier(CLK_GLOBAL_MEM_FENCE);
    if(index == 0){
        double totalMin = min_val;
        for(int i=1; i<length; i++){
            if(scratch_arr[i] < totalMin)
                totalMin = scratch_arr[i];
        }
        min_arr[0] = totalMin;
    }
}

The only thing I can think of that could be happening is that I am using the barrier command incorrectly and all the printf is doing is causing a delay in the kernel that is somehow synchronizing the calls so they all complete before the final reduction step. But without the printf, the kernel 0 executes the final reduction before the other kernels are finished.

Does anyone else have any suggestions or tips on how to debug this issue?

Thanks in advance!!

Upvotes: 0

Views: 101

Answers (1)

acorso
acorso

Reputation: 15

The problem was that the kernel was being launched with one thread per workgroup and barriers only work within a work group. See this response to a similar question: Open CL no synchronization despite barrier

Upvotes: 0

Related Questions