Gaslight Deceive Subvert
Gaslight Deceive Subvert

Reputation: 20418

OpenCL kernel communication using volatile memory

I'm trying to get two OpenCL kernels to communicate with eath other. A worker kernel runs a loop and a control kernel feeds it jobs and tells it when it's done. I'm using a volatile device buffer for communication. It works when I'm using the Intel OpenCL 2.1 platform, but when I'm using the Nvidia OpenCL 3.0 CUDA (Quadro P400) platform the program hangs. It appears that the worker kernel loops forever.

MWE below. The PLAT_IDX define can be 0 or 1 and selects between the Intel and Nvidia platforms:

#include <assert.h>
#include <stdio.h>
#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>
#define PLAT_IDX 0
#define CHK(x)      assert((x) == CL_SUCCESS)

const char *PROGRAM = (
    "__kernel void loop(volatile __global uint *buf) {"
    "   while (buf[0] != 123) { buf[1]++; }"
    "}"
    "__kernel void post(volatile __global uint *buf) {"
    "   buf[0] = 123;"
    "}"
);

int main(int argc, char *argv[]) {
    cl_uint n_platforms, n_devices;
    CHK(clGetPlatformIDs(0, NULL, &n_platforms));
    cl_platform_id plats[2];
    CHK(clGetPlatformIDs(n_platforms, plats, NULL));
    CHK(clGetDeviceIDs(plats[PLAT_IDX], CL_DEVICE_TYPE_ALL, 0, NULL, &n_devices));
    cl_device_id dev;
    CHK(clGetDeviceIDs(plats[PLAT_IDX], CL_DEVICE_TYPE_ALL, 1, &dev, NULL));
    assert(n_platforms > 0 && n_devices > 0);

    cl_int err;
    cl_context ctx = clCreateContext(NULL, 1, &dev, NULL, NULL, &err);
    CHK(err);
    cl_command_queue_properties props[] = {
        CL_QUEUE_PROPERTIES,
        CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, 0
    };
    cl_command_queue queue = clCreateCommandQueueWithProperties(
        ctx, dev, props, &err);
    CHK(err);

    cl_program prog = clCreateProgramWithSource(ctx, 1, &PROGRAM, NULL, &err);
    CHK(err);
    CHK(clBuildProgram(prog, 1, &dev, NULL, NULL, NULL));
    cl_kernel loop = clCreateKernel(prog, "loop", &err);
    CHK(err);
    cl_kernel post = clCreateKernel(prog, "post", &err);
    CHK(err);
    cl_mem mem = clCreateBuffer(
        ctx, CL_MEM_READ_WRITE, 2 * sizeof(cl_uint), NULL, &err);
    CHK(err);
    cl_event evs[2];
    CHK(clSetKernelArg(loop, 0, sizeof(cl_mem), &mem));
    CHK(clSetKernelArg(post, 0, sizeof(cl_mem), &mem));
    CHK(clEnqueueNDRangeKernel(
            queue, loop, 1, NULL, (size_t[]){1},
            NULL, 0, NULL, &evs[0]));
    CHK(clEnqueueNDRangeKernel(
            queue, post, 1, NULL, (size_t[]){1},
            NULL, 0, NULL, &evs[1]));
    printf("Waiting for kernels\n");
    CHK(clWaitForEvents(2, evs));
    return 1;
}

Maybe there is a better way to accomplish this? OpenCL 2.0 has pipes, but very few devices supports them.

Upvotes: 0

Views: 157

Answers (2)

Jan-Gerd
Jan-Gerd

Reputation: 1289

From the documentation:

If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a commandqueue is set, then there is no guarantee that kernel A will finish before kernel B starts execution.

Note that this does not guarantee that kernel A and B will run concurrently, it merely allows the implementation to do so. Your device may still choose to tun kernel A to completion first.

There is no way to force two kernels to run in parallel. However, you can often use events to the same effect: Schedule both kernels as you already do, do your preparation in kernel A, attach an event to its completion, then wait for that event when you schedule kernel B. While B is running, you can use the same event to re-schedule kernel A from your host code. See this presentation for details: http://people.cs.bris.ac.uk/~simonm/workshops/BSC_2013/opencl:course:bsc/Slides/OpenCL_events.pdf

Upvotes: 1

ProjectPhysX
ProjectPhysX

Reputation: 5754

You cannot have two OpenCL kernels communicate. The volatile keyword doesn't enable that either. Kernels are placed in a queue and executed one after the other. Only in separate queues, kernels might get executed at the same time, but there is no guarantee.

Try to solve it without requiring kernel communication. Have one kernel finish and store results in global memory, and then the other kernel work with the results of the first kernel.


As a side note, you can do communication across threads within one kernel. As long as the threads are in the same workgroup, you can use local memory. If you need to communicate across workgroups, you can still do that with atomics.

Upvotes: 2

Related Questions