user9088896
user9088896

Reputation:

Unexpected CPU utilization with OpenCL

I've written a simple OpenCL kernel to calculate the cross-correlation of two images on the GPU. However, when I execute the kernel with enqueueNDRangeKernel the CPU usage of one core rises to 100%, but the host code does nothing except waiting for the enqueued command to finish. Is this normal behavior of an OpenCL program? What is going on there?

OpenCL kernel (if relevant):

kernel void cross_correlation(global double *f,
                              global double *g,
                              global double *res) {
  // This work item will compute the cross-correlation value for pixel w
  const int2 w = (int2)(get_global_id(0), get_global_id(1));

  // Main loop
  int xy_index = 0;
  int xy_plus_w_index = w.x + w.y * X;

  double integral = 0;
  for ( int y = 0; y + w.y < Y; ++y ) {
    for ( int x = 0; x + w.x < X; ++x, ++xy_index, ++xy_plus_w_index ) {
      // xy_index is equal to x + y * X
      // xy_plus_w_index is equal to (x + w.x) + (y + w.y) * X
      integral += f[xy_index] * g[xy_plus_w_index];
    }

    xy_index += w.x;
    xy_plus_w_index += w.x;
  }

  res[w.x + w.y * X] = integral;
}

The images f, g, res have a size of X times Y pixels, where X and Y are set at compile time. I'm testing the above kernel with X = 2048 and Y = 2048.

Additional info: I am running the kernel on a Nvidia GPU with OpenCL version 1.2. The C++ program is written using the OpenCL C++ Wrapper API and executed on Debian using optirun from the bumblebee package.

As requested, here is a minimal working example:

#include <CL/cl.hpp>

#include <sstream>
#include <fstream>

using namespace std;

int main ( int argc, char **argv ) {
  const int X = 2048;
  const int Y = 2048;

  // Create context
  cl::Context context ( CL_DEVICE_TYPE_GPU );

  // Read kernel from file
  ifstream kernel_file ( "cross_correlation.cl" );
  stringstream buffer;
  buffer << kernel_file.rdbuf ( );
  string kernel_code = buffer.str ( );

  // Build kernel
  cl::Program::Sources sources;
  sources.push_back ( { kernel_code.c_str ( ), kernel_code.length ( ) } );
  cl::Program program ( context, sources );
  program.build ( " -DX=2048 -DY=2048" );

  // Allocate buffer memory
  cl::Buffer fbuf ( context, CL_MEM_READ_WRITE, X * Y * sizeof(double) );
  cl::Buffer gbuf ( context, CL_MEM_READ_WRITE, X * Y * sizeof(double) );
  cl::Buffer resbuf ( context, CL_MEM_WRITE_ONLY, X * Y * sizeof(double) );

  // Create command queue
  cl::CommandQueue queue ( context );

  // Create kernel
  cl::Kernel kernel ( program, "cross_correlation" );

  kernel.setArg ( 0, fbuf );
  kernel.setArg ( 1, gbuf );
  kernel.setArg ( 2, resbuf );

  // Set input arguments
  double *f = new double[X*Y];
  double *g = new double[X*Y];

  for ( int i = 0; i < X * Y; i++ )
    f[i] = g[i] = 0.001 * i;

  queue.enqueueWriteBuffer ( fbuf, CL_TRUE, 0, X * Y * sizeof(double), f );
  queue.enqueueWriteBuffer ( gbuf, CL_TRUE, 0, X * Y * sizeof(double), g );

  // Execute kernel
  queue.enqueueNDRangeKernel ( kernel, cl::NullRange, cl::NDRange ( X, Y ), cl::NullRange, NULL, NULL );
  queue.finish ( );

  return 0;
}

Upvotes: 0

Views: 486

Answers (1)

Rags
Rags

Reputation: 309

You don't say how you call enqueueNDRangeKernel - which is the critical bit. As I understand it, for NVidia, the call is blocking (although I don't think it's part of the standard that it should be so.) You can get around this by having a separate thread invoke enqueueNDRangeKernel and let that thread block on it whilst your other threads continue, and teh blocking thread can signal an event when it completes.

There's a discussion on it here - and it raises some caveats about having multiple calls to the enqueue occurring in parallel.

Upvotes: 0

Related Questions