Haomin J
Haomin J

Reputation: 41

How to achieve opencl kernel pipeline

I am working on my project using OpenCl. In order to improve the performance of my algorithm, is it possible to pipeline a single kernel? If a kernel consists of many steps, lets say A,B,C, I want A accept new data as soon as it finish its part and pass it to B. I can create channels between them, but I dont know how to do it in detail.

Can I write A,B,C(3 kernels) in a .cl file ? but how to enqueueNDRange? I am using Altera SDK for FPGA HPC development. Thanks.

Upvotes: 2

Views: 2260

Answers (1)

Andrew Savonichev
Andrew Savonichev

Reputation: 699

Pipeline can be achieved by using several kernels connected with channels. With all kernels running concurrently, data is transferred from one to another:

Pipeline example from Intel FPGA OpenCL SDK Programming Guide

Very basic example of such pipeline would be:

channel int foo_bar_channel;
channel float bar_baz_channel;

__kernel void foo(__global int* in) {
  for (int i = 0; i < 1024; ++i) {
    int value = in[i];
    value = clamp(value, 0, 255);                 // do some work
    write_channel_altera(foo_bar_channel, value); // send data to the next kernel
  }
}

__kernel void bar() {
  for (int i = 0; i < 1024; ++i) {
    int value = read_channel_altera(foo_bar_channel); // take data from foo
    float fvalue = (float) value;
    write_channel_altera(bar_baz_channel, value); // send data to the next kernel
  }
}

__kernel void baz(__global int* out) {
  for (int i = 0; i < 1024; ++i) {n
    float value = read_channel_altera(bar_baz_channel);
    float s = sin(value);
    out[i] = s;                                  // write result in the end
  }
}

You can write all kernels in the a single .cl file, or use different files and then #include them into a main .cl file.

We want all our kernels run concurrently, so they can accept data from each other. Since only in-order command queues are supported, we have to use different queue for each kernel:

cl_queue foo_queue = clCreateCommandQueue(...);
cl_queue bar_queue = clCreateCommandQueue(...);
cl_queue baz_queue = clCreateCommandQueue(...);

clEnqueueTask(foo_queue, foo_kernel);
clEnqueueTask(bar_queue, bar_kernel);
clEnqueueTask(baz_queue, baz_kernel);

clFinish(baz_queue); // last kernel in our pipeline

Unlike OpenCL programming for GPU, we rely on a data pipelining, so NDRange kernels would not give us any benefit. Single work-item kernels are used instead of NDRange kernels, so we enqueue them using clEnqueueTask function. Additional kernel attribute (reqd_work_group_size) can be used to mark a single work-item kernel, to give the compiler some room for optimizations.

Check the Intel FPGA SDK for OpenCL Programming Guide for more information about channels and kernel attributes (specifically, section 1.6.4 Implementing the Intel FPGA SDK for OpenCL Channels Extension):

https://www.altera.com/en_US/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf

Upvotes: 4

Related Questions