Peemer
Peemer

Reputation: 49

How to access a class from one cuda kernel in the next kernel

I have a dev variable which I used to allocate space on the device using a class header.

Neu *dev_NN;
cudaStatus = cudaMalloc((void**)&dev_NN, sizeof(Neu));

Then I call a kernel which initialises the class on the GPU.

KGNN<<<1, threadsPerBlock>>>(dev_LaySze, dev_NN);

in the kernel

__global__ void KGNN(int * dev_LaySze, Neu * NN)
{
    ...
    NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
}  

After the return of this kernel I want to use another kernel to input data to class methods and retrieve output data (the allocators and copies are already done and work), such as

__global__ void KGFF(double *dev_inp,  double *dev_outp, int *DataSize)
{
    int i = threadIdx.x;
    ...
    NN.Analyse(dev_inp, dev_outp, DataSize );
}

The second kernel knows nothing about the class that was created. As you would expect NN is unrecognised. How do I access the first NN without re-creating the class and re-initialising it? The second kernel has to be called several times, remembering the changes it made to the class variables earlier. I don't want to use the class with the CPU, only the GPU, and I don't want to pass it back and forth each time.

Upvotes: 1

Views: 594

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151879

I don't think this has anything to do with CUDA, actually. I believe a similar problem would be observed if you tried this in ordinary C++ (assuming the pointer to NN is not a global variable).

The key aspect of the solution as pointed out by Park Young-Bae is simply to pass the pointer to the allocated space for NN to both kernels. There were a few other changes that I think needed to be made to what you have shown, according to my understanding of what you are trying to do (since you haven't posted a complete code.) Here's a fully worked example:

$ cat t635.cu
#include <stdio.h>

class MC {

  int md;
  public:
  __host__ __device__ int get_md() { return md;}
  __host__ __device__ MC(int val) { md = val; }
};

__global__ void kernel1(MC *d){

  *d = MC(3);
}

__global__ void kernel2(MC *d){

  printf("val = %d\n", d->get_md());
}

int main(){

  MC *d_obj;
  cudaMalloc(&d_obj, sizeof(MC));
  kernel1<<<1,1>>>(d_obj);
  kernel2<<<1,1>>>(d_obj);
  cudaDeviceSynchronize();
  return 0;
}
$ nvcc -arch=sm_20 -o t635 t635.cu
$ ./t635
val = 3
$

The other changes I suggest:

  1. in your first kernel, you're passing a pointer (NN) (which presumably you have made a device allocation for), and then you are creating an opject and copying that object to the allocated space. In that case I think you need:

    *NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
    
  2. in your second kernel, if NN is a pointer, we must use:

    NN->Analyse(dev_inp, dev_outp, DataSize );
    

I have made those two changes to my posted example. Again, I think this is all just C++ mechanics, not anything specific to CUDA.

Upvotes: 1

Related Questions