Arktis
Arktis

Reputation: 3

CUDA - Dynamic Shared Memory with Derived Classes

I've been trying to get some of my other code to run, and I've run into a bit of an issue with dynamic shared memory. According to documentation (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared), I'm supposed to allocate one array of memory, and then typecast pointers to specific locations in that array like this:

extern __shared__ float array[];

short* array0 = (short*)array; 
float* array1 = (float*)&array0[128];
int*   array2 =   (int*)&array1[64];

However, in my own code, this doesn't necessarily always work, and I can't quite figure out why.

My basic structure has 2 classes A and B as well as an error checking macro

#define cudaCheckError() { \
  cudaError_t err = cudaGetLastError(); \
  if(err != cudaSuccess) { \
    printf("Cuda error: %s:%d: Error code %d, %s\n", __FILE__, __LINE__, err,cudaGetErrorString(err)); \
    exit(1); \
  } \
}

class A {
    public:
    
    __device__ virtual int foo() const = 0;
};

class B : public A {
    public:
    
    __device__ B() {}
    
    __device__ virtual int foo() const override {
        return 1;
    }
};

and my kernel

__global__
void kernel() {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    extern __shared__ int shared[];
    B* b_array = (B *) &shared[0];
    
    if (idx == 0) {
        b_array[0] = B();
        
        printf("%i", b_array[0].foo());
    }
    
    __syncthreads();
    
    return;
}

Invoking that kernel with sufficient shared memory specified kernel<<<1, 1, 1000>>> and checking the error code yields an error Error code 700, an illegal memory access was encountered. Running cuda-memcheck on this also gives an error code, although a different one: Error code 719, unspecified launch failure

Changing the kernel to:

__global__
void kernel() {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    extern __shared__ B shared[];
    
    if (idx == 0) {
        shared[0] = B();
        
        printf("%i", shared[0].foo());
    }
    
    __syncthreads();
    
    return;
}

And rerunning gives the expected output without an error.

Is this some sort of issue with derived classes and typecasting in CUDA? I'm not copying objects between host and device, so that shouldn't be an issue. Is it just not possible to cast to an array of objects like I want to do?

Upvotes: 0

Views: 434

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152249

Based on my experience, an object copy:

= B();

does not copy the virtual function pointer table. Therefore it is necessary for the virtual function pointer table to be set properly in whatever object you are accessing a virtual function from.

This allows for that:

extern __shared__ B shared[];

This does not:

extern __shared__ int shared[];

AFAIK aspects of this are implementation specific; not required by the C++ standard.

As a proof point, we can do something like this in your failing kernel:

__global__
void kernel() {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    extern __shared__ int shared[];
    B* b_array = (B *) &shared[0];
    
    if (idx == 0) {
        B temp = B();
        memcpy(b_array, &temp, sizeof(B));
        
        printf("%i", b_array[0].foo());
    }
    
    __syncthreads();
    
    return;
}

which will then work. I'm not suggesting this is the right way to code it. I'm simply using this to suggest that at least one problem here is the handling of the table. As Jerome Richard points out in the comments, the usage of an underlying int array for type-punning to something else may be illegal, however as you point out, the cuda docs seem to suggest this.

We can also construct a host code test case following your failing example:

$ cat t131.cpp
#include <cstdio>

class A {
    public:
     virtual int foo() const = 0;
};

class B : public A {
    public:
     B() {}
     virtual int foo() const override {
        return 3;
    }
};

void k1() {

    int sh1[100];
    B* b_array = (B *) &sh1[0];
        b_array[0] = B();

        printf("k1 %i\n", b_array[0].foo());


    return;
}

int main(){
  k1();
}

$ g++ t131.cpp -o t131
$ ./t131
Segmentation fault (core dumped)
$

Which also fails.

You're welcome to file a bug if you find fault with my description or simply wish this case to be handled.

The exact code matters here, so slight changes to my test cases above may result in working or failing code.

Upvotes: 1

Related Questions