Unexpected CPU utilization with OpenCL

Question

I've written a simple OpenCL kernel to calculate the cross-correlation of two images on the GPU. However, when I execute the kernel with enqueueNDRangeKernel the CPU usage of one core rises to 100%, but the host code does nothing except waiting for the enqueued command to finish. Is this normal behavior of an OpenCL program? What is going on there?

OpenCL kernel (if relevant):

kernel void cross_correlation(global double *f,
                              global double *g,
                              global double *res) {
  // This work item will compute the cross-correlation value for pixel w
  const int2 w = (int2)(get_global_id(0), get_global_id(1));

  // Main loop
  int xy_index = 0;
  int xy_plus_w_index = w.x + w.y * X;

  double integral = 0;
  for ( int y = 0; y + w.y < Y; ++y ) {
    for ( int x = 0; x + w.x < X; ++x, ++xy_index, ++xy_plus_w_index ) {
      // xy_index is equal to x + y * X
      // xy_plus_w_index is equal to (x + w.x) + (y + w.y) * X
      integral += f[xy_index] * g[xy_plus_w_index];
    }

    xy_index += w.x;
    xy_plus_w_index += w.x;
  }

  res[w.x + w.y * X] = integral;
}

The images f, g, res have a size of X times Y pixels, where X and Y are set at compile time. I'm testing the above kernel with X = 2048 and Y = 2048.

Additional info: I am running the kernel on a Nvidia GPU with OpenCL version 1.2. The C++ program is written using the OpenCL C++ Wrapper API and executed on Debian using optirun from the bumblebee package.

As requested, here is a minimal working example:

#include 

#include 
#include 

using namespace std;

int main ( int argc, char **argv ) {
  const int X = 2048;
  const int Y = 2048;

  // Create context
  cl::Context context ( CL_DEVICE_TYPE_GPU );

  // Read kernel from file
  ifstream kernel_file ( "cross_correlation.cl" );
  stringstream buffer;
  buffer << kernel_file.rdbuf ( );
  string kernel_code = buffer.str ( );

  // Build kernel
  cl::Program::Sources sources;
  sources.push_back ( { kernel_code.c_str ( ), kernel_code.length ( ) } );
  cl::Program program ( context, sources );
  program.build ( " -DX=2048 -DY=2048" );

  // Allocate buffer memory
  cl::Buffer fbuf ( context, CL_MEM_READ_WRITE, X * Y * sizeof(double) );
  cl::Buffer gbuf ( context, CL_MEM_READ_WRITE, X * Y * sizeof(double) );
  cl::Buffer resbuf ( context, CL_MEM_WRITE_ONLY, X * Y * sizeof(double) );

  // Create command queue
  cl::CommandQueue queue ( context );

  // Create kernel
  cl::Kernel kernel ( program, "cross_correlation" );

  kernel.setArg ( 0, fbuf );
  kernel.setArg ( 1, gbuf );
  kernel.setArg ( 2, resbuf );

  // Set input arguments
  double *f = new double[X*Y];
  double *g = new double[X*Y];

  for ( int i = 0; i < X * Y; i++ )
    f[i] = g[i] = 0.001 * i;

  queue.enqueueWriteBuffer ( fbuf, CL_TRUE, 0, X * Y * sizeof(double), f );
  queue.enqueueWriteBuffer ( gbuf, CL_TRUE, 0, X * Y * sizeof(double), g );

  // Execute kernel
  queue.enqueueNDRangeKernel ( kernel, cl::NullRange, cl::NDRange ( X, Y ), cl::NullRange, NULL, NULL );
  queue.finish ( );

  return 0;
}

Unexpected CPU utilization with OpenCL

Answers (1)

Related Questions