Reputation:
I've written a simple OpenCL kernel to calculate the cross-correlation of two images on the GPU. However, when I execute the kernel with enqueueNDRangeKernel
the CPU usage of one core rises to 100%, but the host code does nothing except waiting for the enqueued command to finish. Is this normal behavior of an OpenCL program? What is going on there?
OpenCL kernel (if relevant):
kernel void cross_correlation(global double *f,
global double *g,
global double *res) {
// This work item will compute the cross-correlation value for pixel w
const int2 w = (int2)(get_global_id(0), get_global_id(1));
// Main loop
int xy_index = 0;
int xy_plus_w_index = w.x + w.y * X;
double integral = 0;
for ( int y = 0; y + w.y < Y; ++y ) {
for ( int x = 0; x + w.x < X; ++x, ++xy_index, ++xy_plus_w_index ) {
// xy_index is equal to x + y * X
// xy_plus_w_index is equal to (x + w.x) + (y + w.y) * X
integral += f[xy_index] * g[xy_plus_w_index];
}
xy_index += w.x;
xy_plus_w_index += w.x;
}
res[w.x + w.y * X] = integral;
}
The images f, g, res
have a size of X
times Y
pixels, where X
and Y
are set at compile time. I'm testing the above kernel with X = 2048
and Y = 2048
.
Additional info: I am running the kernel on a Nvidia GPU with OpenCL version 1.2. The C++ program is written using the OpenCL C++ Wrapper API and executed on Debian using optirun from the bumblebee package.
As requested, here is a minimal working example:
#include <CL/cl.hpp>
#include <sstream>
#include <fstream>
using namespace std;
int main ( int argc, char **argv ) {
const int X = 2048;
const int Y = 2048;
// Create context
cl::Context context ( CL_DEVICE_TYPE_GPU );
// Read kernel from file
ifstream kernel_file ( "cross_correlation.cl" );
stringstream buffer;
buffer << kernel_file.rdbuf ( );
string kernel_code = buffer.str ( );
// Build kernel
cl::Program::Sources sources;
sources.push_back ( { kernel_code.c_str ( ), kernel_code.length ( ) } );
cl::Program program ( context, sources );
program.build ( " -DX=2048 -DY=2048" );
// Allocate buffer memory
cl::Buffer fbuf ( context, CL_MEM_READ_WRITE, X * Y * sizeof(double) );
cl::Buffer gbuf ( context, CL_MEM_READ_WRITE, X * Y * sizeof(double) );
cl::Buffer resbuf ( context, CL_MEM_WRITE_ONLY, X * Y * sizeof(double) );
// Create command queue
cl::CommandQueue queue ( context );
// Create kernel
cl::Kernel kernel ( program, "cross_correlation" );
kernel.setArg ( 0, fbuf );
kernel.setArg ( 1, gbuf );
kernel.setArg ( 2, resbuf );
// Set input arguments
double *f = new double[X*Y];
double *g = new double[X*Y];
for ( int i = 0; i < X * Y; i++ )
f[i] = g[i] = 0.001 * i;
queue.enqueueWriteBuffer ( fbuf, CL_TRUE, 0, X * Y * sizeof(double), f );
queue.enqueueWriteBuffer ( gbuf, CL_TRUE, 0, X * Y * sizeof(double), g );
// Execute kernel
queue.enqueueNDRangeKernel ( kernel, cl::NullRange, cl::NDRange ( X, Y ), cl::NullRange, NULL, NULL );
queue.finish ( );
return 0;
}
Upvotes: 0
Views: 486
Reputation: 309
You don't say how you call enqueueNDRangeKernel - which is the critical bit. As I understand it, for NVidia, the call is blocking (although I don't think it's part of the standard that it should be so.) You can get around this by having a separate thread invoke enqueueNDRangeKernel and let that thread block on it whilst your other threads continue, and teh blocking thread can signal an event when it completes.
There's a discussion on it here - and it raises some caveats about having multiple calls to the enqueue occurring in parallel.
Upvotes: 0