Eugene Kolesnikov
Eugene Kolesnikov

Reputation: 663

Accessing Class Member in different CUDA kernels

I have a GPU-only class T which I want to create on GPU but have a reference to which on the CPU, so I can send the link as an argument to different CUDA kernels.

class T
{
public:
    int v;
public:
    __device__ T() { v = 10; }
    __device__ ~T() {}
    __device__ int compute() { return v; }
};

Here are the kernels that I was to create the class instance and to call the compute() function.

__global__ void kernel(T* obj, int* out)
{
    if(blockIdx.x * blockDim.x + threadIdx.x == 0) {
        out[0] = obj->compute(); // no kernel error, but it returns garbage
    }
}

__global__ void cudaAllocateGPUObj(T* obj)
{
    if(blockIdx.x * blockDim.x + threadIdx.x == 0) {
        obj = new T;
        // if I call `out[0] = obj->compute();` here, everything works fine
    }
}

The main function simply allocates memory for the pointer of type T* which later is used as an argument for the cudaAllocateGPUObj.

int main()
{
    int cpu, *gpu;
    cudaMalloc((void**)&gpu, sizeof(int));
    T* obj;
    cudaMalloc((void**)&obj, sizeof(T*));
    cudaAllocateGPUObj<<<1,1>>>(obj);
    kernel<<<1,1>>>(obj, gpu);
    cudaMemcpy(&cpu, gpu, sizeof(int), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    printf("cudaMemcpy\nresult: %d\n", cpu);
    return 0;
}

The problem with this code (as specified in the comments in the code) is that when I call out[0] = obj->compute(); in the cudaAllocateGPUObj kernel and transfer the obtained value to the CPU, everything is correct. But if I want to obtain the member value in another kernel, it becomes garbage, though if I change the return value from the v variable to a constant, everything works fine.

Could you please tell me what is wrong with this code.

Upvotes: 1

Views: 1136

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151869

When you pass a parameter to a CUDA kernel, it is a pass-by-value mechanism. You have started with a pointer to an object:

T* obj;

then, instead of allocating storage for the object, you allocate storage for another pointer:

cudaMalloc((void**)&obj, sizeof(T*));

so we're headed down the wrong path here. (This is a logical C programming error at this point.) Next, in the allocate kernel, the obj parameter (which now points to some location in GPU memory space) is passed by value:

__global__ void cudaAllocateGPUObj(T* obj)
                                      ^^^ pass-by-value: local copy is made

Now, when you do this:

        obj = new T;

You create a new pointer, and overwrite the local copy of obj with that new pointer. So of course that works locally, but the copy of obj in the calling environment is not updated with that new pointer.

One possible method to fix this is to create a proper pointer-to-pointer methodology:

$ cat t5.cu
#include <stdio.h>

class T
{
public:
    int v;
public:
    __device__ T() { v = 10; }
    __device__ ~T() {}
    __device__ int compute() { return v; }
};

__global__ void kernel(T** obj, int* out)
{
    if(blockIdx.x * blockDim.x + threadIdx.x == 0) {
        out[0] = (*obj)->compute(); 
    }
}

__global__ void cudaAllocateGPUObj(T** obj)
{
    if(blockIdx.x * blockDim.x + threadIdx.x == 0) {
        *obj = new T;
    }
}

int main()
{
    int cpu, *gpu;
    cudaMalloc((void**)&gpu, sizeof(int));
    T** obj;
    cudaMalloc(&obj, sizeof(T*));
    cudaAllocateGPUObj<<<1,1>>>(obj);
    kernel<<<1,1>>>(obj, gpu);
    cudaMemcpy(&cpu, gpu, sizeof(int), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    printf("cudaMemcpy\nresult: %d\n", cpu);
    return 0;
}

$ nvcc -arch=sm_35 -o t5 t5.cu
$ cuda-memcheck ./t5
========= CUDA-MEMCHECK
cudaMemcpy
result: 10
========= ERROR SUMMARY: 0 errors
$

Upvotes: 3

Related Questions