Reputation: 49
I have a dev variable which I used to allocate space on the device using a class header.
Neu *dev_NN;
cudaStatus = cudaMalloc((void**)&dev_NN, sizeof(Neu));
Then I call a kernel which initialises the class on the GPU.
KGNN<<<1, threadsPerBlock>>>(dev_LaySze, dev_NN);
in the kernel
__global__ void KGNN(int * dev_LaySze, Neu * NN)
{
...
NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
}
After the return of this kernel I want to use another kernel to input data to class methods and retrieve output data (the allocators and copies are already done and work), such as
__global__ void KGFF(double *dev_inp, double *dev_outp, int *DataSize)
{
int i = threadIdx.x;
...
NN.Analyse(dev_inp, dev_outp, DataSize );
}
The second kernel knows nothing about the class that was created. As you would expect NN is unrecognised. How do I access the first NN without re-creating the class and re-initialising it? The second kernel has to be called several times, remembering the changes it made to the class variables earlier. I don't want to use the class with the CPU, only the GPU, and I don't want to pass it back and forth each time.
Upvotes: 1
Views: 594
Reputation: 151879
I don't think this has anything to do with CUDA, actually. I believe a similar problem would be observed if you tried this in ordinary C++ (assuming the pointer to NN
is not a global variable).
The key aspect of the solution as pointed out by Park Young-Bae is simply to pass the pointer to the allocated space for NN
to both kernels. There were a few other changes that I think needed to be made to what you have shown, according to my understanding of what you are trying to do (since you haven't posted a complete code.) Here's a fully worked example:
$ cat t635.cu
#include <stdio.h>
class MC {
int md;
public:
__host__ __device__ int get_md() { return md;}
__host__ __device__ MC(int val) { md = val; }
};
__global__ void kernel1(MC *d){
*d = MC(3);
}
__global__ void kernel2(MC *d){
printf("val = %d\n", d->get_md());
}
int main(){
MC *d_obj;
cudaMalloc(&d_obj, sizeof(MC));
kernel1<<<1,1>>>(d_obj);
kernel2<<<1,1>>>(d_obj);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -o t635 t635.cu
$ ./t635
val = 3
$
The other changes I suggest:
in your first kernel, you're passing a pointer (NN
) (which presumably you have made a device allocation for), and then you are creating an opject and copying that object to the allocated space. In that case I think you need:
*NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
in your second kernel, if NN
is a pointer, we must use:
NN->Analyse(dev_inp, dev_outp, DataSize );
I have made those two changes to my posted example. Again, I think this is all just C++ mechanics, not anything specific to CUDA.
Upvotes: 1