Can a class simply be passed to a CUDA kernel for parallel evaluation?

Question

I have a rather complicated c++ class which has functions of the sort myObj.fun(x,y,z). I would like to call this on a GPU on a 3D grid of points.

My high level question is: is it a simple matter to pass myObj and a large number of points to the GPU? I was avoiding creating a CUDA implementation of this function because of the work, but it occurs to me this may be very simple.

Again, this is a very high level question so "yes, it's that easy" or "no, it's more complicated" are welcome, though a bit if direction would also be helpful.

Robert Crovella · Accepted Answer

It's possible to use classes/objects on the GPU, including their methods (e.g. fun()). Such classes would at a minimum have to have methods that are decorated with __host__ __device__, but the code refactoring may not be any more involved than that.

However, such methods (like any other code with no refactoring) will probably not access any parallel power of the GPU. The most basic comparison would be that methods that run in a single CPU thread would then run in a single GPU thread. This normally is not faster and frequently would be a lot slower, if all you did was pass a single object to the GPU and run the equivalent single-threaded code on the GPU (in a single GPU thread).

One possible strategy is that if you have a great deal of these objects, or equivalently, in your case, a great deal of "points", that represent independent work to be done on each, then you could pass each one (object, or point) to a GPU thread, and work on them that way, so as to enable massively multithreaded operation, which GPUs like. Ideally you would have 10,000 or more points to process this way.

This still usually isn't the most efficient use of the GPU for a variety of reasons, one of which has to do with efficient data access, and another has to do with (possible) thread divergence. Nevertheless, some people do pursue this "simple", "embarassingly parallel" approach to code porting, occasionally with interesting speed-ups.

Depending on your actual code, you might see attractive results if you pass the points to the GPU in a fashion that allows adjacent threads to access adjacent data, for each operation that accesses the points. It's quite reasonable to expect that you might get an attractive speed up in that fashion, perhaps with relatively little code refactoring, but attention to data organization for optimal GPU access.

Here's a fully worked example:

$ cat t30.cu
#include 
#include 

const int dsize = 3;
const int nTPB = 256;
const int rng = 8;

class myclass
{

  int increment;
  public:
    myclass(int _incr): increment(_incr) {};
    // methods callable on the device need the __device__ decoration
    __host__ __device__ void fun(int &x, int &y, int &z){
      x += increment;
      y += increment;
      z += increment;}

};

// this is the actual device routine that is run per thread
__global__ void mykernel(myclass obj, int *dx, int *dy, int *dz, int dsize){

  int idx = threadIdx.x+blockDim.x*blockIdx.x; // figure out which thread we are
  if (idx < dsize)
    obj.fun(dx[idx], dy[idx], dz[idx]);  // apply method
}


int main(){

  // allocate host data
  int *p_x, *p_y, *p_z, *d_x, *d_y, *d_z;
  p_x = new int[dsize];
  p_y = new int[dsize];
  p_z = new int[dsize];

  // allocate device data
  cudaMalloc(&d_x, dsize*sizeof(int));
  cudaMalloc(&d_y, dsize*sizeof(int));
  cudaMalloc(&d_z, dsize*sizeof(int));

  // initialize host data
  std::cout << "Before:" << std::endl;
  for (int i = 0; i < dsize; i++){
    p_x[i] = rand()%rng;
    p_y[i] = rand()%rng;
    p_z[i] = rand()%rng;
    std::cout << p_x[i] << "," << p_y[i] << "," << p_z[i] << std::endl;}

  // copy to device
  cudaMemcpy(d_x, p_x, dsize*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, p_y, dsize*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_z, p_z, dsize*sizeof(int), cudaMemcpyHostToDevice);

  // instantiate object on host
  myclass test(1);

  // copy object to device as kernel parameter
  mykernel<<<(dsize+nTPB-1)/nTPB, nTPB>>>(test, d_x, d_y, d_z, dsize);

  // copy data back to host
  cudaMemcpy(p_x, d_x, dsize*sizeof(int), cudaMemcpyDeviceToHost);
  cudaMemcpy(p_y, d_y, dsize*sizeof(int), cudaMemcpyDeviceToHost);
  cudaMemcpy(p_z, d_z, dsize*sizeof(int), cudaMemcpyDeviceToHost);


  std::cout << "After:" << std::endl;
  for (int i = 0; i < dsize; i++){
    std::cout << p_x[i] << "," << p_y[i] << "," << p_z[i] << std::endl;}

  return 0;
}
$ nvcc -o t30 t30.cu
$ ./t30
Before:
7,6,1
3,1,7
2,4,1
After:
8,7,2
4,2,8
3,5,2
$

For brevity of presentation, I've omitted proper cuda error checking but I would always recommend you use it when you are developing CUDA codes.

Can a class simply be passed to a CUDA kernel for parallel evaluation?

Answers (1)

Related Questions