clBuildProgram yields AccessViolationException when building this specific kernel

Question

This is a part of some sort of parallel reduction/extremum kernel. I have reduced it to the minimum code that still gets clBuildProgram crashing (note that it really crashes, and doesn't just return an error code):

EDIT: It seems like this also happens when local_value is declared global instead of local.

EDIT2 / SOLUTION: The problem was that there was an infinite loop. I should have written remaining_items >>= 1 instead of remaining_items >> 1. As has been said in the answers, the nvidia compiler seems not very robust when it comes to compile/optimization errors.

kernel void testkernel(local float *local_value)
{
    size_t thread_id = get_local_id(0);

    int remaining_items = 1024;

    while (remaining_items > 1)
    {
        // throw away the right half of the threads
        remaining_items >> 1; // <-- SPOTTED THE BUG
        if (thread_id > remaining_items)
        {
            return;
        }

        // look for a greater value in the right half of the memory space
        int right_index = thread_id + remaining_items;
        float right_value = local_value[right_index];
        if (right_value > local_value[thread_id])
        {
            local_value[thread_id] = right_value;
        }

        barrier(CLK_GLOBAL_MEM_FENCE);
    }
}

Removing the lines return; and/or local_value[thread_id] = right_value; causes clBuildProgram to finish successfully.

I can reproduce this problem on all of my computers (NVIDIA GTX 560, GT 555M, GT 540M, they're all Fermi 2.1 architecture). It's apparent on the NVIDIA CUDA Toolkit SDK versions 4.0, 4.1 and 4.2, when using either x64 or x86 libraries.

Does anyone have an idea what could be the problem?

Is it possible, that local (aka shared) memory is automatically assumed to be (WORK_GROUP_SIZE) * siezof(its_base_type)? That would explain why it works when the lines I mentioned above are removed.

Minimal host code (C99 compatible) for reproduction:

#include 
#include 
#include 
#include 

#ifdef __APPLE__
#include 
#else
#include 
#endif

#define RETURN_THROW(expression) do { cl_int ret = expression; if (ret) { printf(#expression " FAILED: %d
" , ret); exit(1); } } while (0)
#define REF_THROW(expression) do { cl_int ret; expression; if (ret) { printf(#expression " FAILED: %d
" , ret); exit(1); } } while (0)

int main(int argc, char **argv)
{
    // Load the kernel source code into the array source_str
    FILE *fp;

    fp = fopen("testkernel.cl", "rb");
    if (!fp)
    {
        fprintf(stderr, "Failed to load kernel.
");
        exit(1);
    }
    fseek(fp, 0, SEEK_END);
    int filesize = ftell(fp);
    rewind(fp);
    char *source_str = (char*)calloc(filesize, sizeof(char));
    size_t bytes_read = fread(source_str, 1, filesize, fp);
    source_str[bytes_read] = 0;
    fclose(fp);

    // Get platform information
    cl_uint num_platforms;
    RETURN_THROW(clGetPlatformIDs(0, NULL, &num_platforms));

    cl_platform_id *platform_ids = (cl_platform_id *)calloc(num_platforms, sizeof(cl_platform_id));
    RETURN_THROW(clGetPlatformIDs(num_platforms, platform_ids, NULL));

    cl_device_id selected_device_id = NULL;

    printf("available platforms:
");
    for (cl_uint i = 0; i < num_platforms; i++)
    {
        char platform_name[50];
        RETURN_THROW(clGetPlatformInfo(platform_ids[i], CL_PLATFORM_NAME, 50, platform_name, NULL));
        printf("%s
", platform_name);

        // get devices for this platform
        cl_uint num_devices;
        RETURN_THROW(clGetDeviceIDs(platform_ids[i], CL_DEVICE_TYPE_GPU, 0, NULL, &num_devices));

        cl_device_id *device_ids = (cl_device_id *)calloc(num_devices, sizeof(cl_device_id));
        RETURN_THROW(clGetDeviceIDs(platform_ids[i], CL_DEVICE_TYPE_GPU, num_devices, device_ids, NULL));

        // select first nvidia device
        if (strstr(platform_name, "NVIDIA"))        // ADAPT THIS ACCORDINGLY
        {
            selected_device_id = device_ids[0];
        }
    }

    if (selected_device_id == NULL)
    {
        printf("No NVIDIA device found
");
        exit(1);
    }

    // Create an OpenCL context
    cl_context context;
    REF_THROW(context = clCreateContext(NULL, 1, &selected_device_id, NULL, NULL, &ret));

    // Create a program from the kernel source
    cl_program program;
    REF_THROW(program = clCreateProgramWithSource(context, 1, (const char **)&source_str, NULL, &ret));

    // Build the program
    cl_int ret = clBuildProgram(program, 1, &selected_device_id, NULL, NULL, NULL);
    if (ret)
    {
        printf("BUILD ERROR
");
        // build error - get build log and display it
        size_t build_log_size;
        ret = clGetProgramBuildInfo(program, selected_device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &build_log_size);
        char *build_log = new char[build_log_size];
        ret = clGetProgramBuildInfo(program, selected_device_id, CL_PROGRAM_BUILD_LOG, build_log_size, build_log, NULL);
        printf("%s
", build_log);
        exit(1);
    }

    printf("build finished successfully
");
    return 0;
}

Grizzly · Accepted Answer

In my experience the nvidia compiler isn't very robust when it comes to handling build errors, so you probably have a compile error somewhere.

I think your problem is indeed the return, or more to the point its combination with barrier. According to the opencl spec about barriers:

All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel.

If barrier is inside a conditional statement, then all work-items must enter the onditional if any work-item enters the conditional statement and executes the barrier.

If barrer is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.

So I think your problem is probably that a lot of threads would return before getting to the barrier, making this code invalid. Maybe you should try something like this:

kernel void testkernel(local float *local_value) {
    size_t thread_id = get_local_id(0);
    int remaining_items = 1024;
    while (remaining_items > 1) {
        remaining_items >>= 1;// throw away the right half of the threads
        if (thread_id <= remaining_items) {
             // look for a greater value in the right half of the memory space
             int right_index = thread_id + remaining_items;
             float right_value = local_value[right_index];
             if (right_value > local_value[thread_id])
                 local_value[thread_id] = right_value;
        }
        barrier(CLK_GLOBAL_MEM_FENCE);
    }
}

Edit: Furthermore as noted in the comments it needs to be remaining_items>>=1 instead of remaining_items>>1 in order to avoid producing an infinite loop.

clBuildProgram yields AccessViolationException when building this specific kernel

Answers (1)

Related Questions