CUDA shared memory wrapped in templated class, points to same memory

Question

I am trying to allocate shared memory in a CUDA kernel within a templated class:

template
struct SharedArray {

    __device__ T* operator()(){
        __shared__ T x[Size];
        return x;
    }
};

This works as long no shared memory with same type and size is retrieved twice. But when I try to get two times shared memory with same type and size, then the second shared memory points to the first one:

__global__
void test() {

    // Shared array
    SharedArray sharedArray;
    int* x0 = sharedArray();
    int* y0 = sharedArray();
    x0[0] = 1;
    y0[0] = 0;
    printf("%i %i

", x0[0], y0[0]);
    // Prints:
    // 0 0
}

One solution is to add an id with each call to the shared memory class like:

template
struct StaticSharedArrayWithID {

    __device__ static T* shared(){
        __shared__ T x[Size];
        return x;
    }   
};

But then I have to provide some counter which provides a very ugly user interface:

__global__
void test() {

    int& x1 = StaticSharedArrayWithID<__COUNTER__, int, 5>::shared();
    int& y1 = StaticSharedArrayWithID<__COUNTER__, int, 5>::shared();
    x1[0] = 1;
    y1[0] = 0;
    printf("%i %i

", x1[0], y1[0]);
    // Prints:
    // 1 0
}

Does anyone has a idea to get rid of the __COUNTER__ macro in the user interface? It is okay when it is hidden.

CygnusX1 · Accepted Answer

The reason for this is because __shared__ variables are static by default. Same instance of the same function refers to the same variable. The original reason for this behavior is because the compiler cannot deduct when the memory can be reclaimed. Having a variable static makes it live as long as the kernel.

A side effect is that if you have the same function called twice it two places in the program - you get the same result. In fact that is what you expect anyway when multiple CUDA threads call your function in the same spot, don't you?

There is no clean way to allocate shared memory dynamically. In my projects I did it through my own shared memory memory manager (ugly pointer arithmetic ahead, beware!):

typedef unsigned char byte;

/*
  Simple shared memory manager.
    With any luck if invoked with constant parameters this will not take up any register whatsoever
    Must be called uniformly by whole block which is going to use these
    sSize - amount of preallocated memory
*/
template 
class SharedMemoryManager {
private:
    byte* shArray;
    byte* head;
public:

    __device__ SharedMemoryManager() {
        __shared__ byte arr[sSize];
        shArray=arr;
        head=arr;
    }

    __device__  void reset() {
        head=shArray;
    }

    __device__  byte* getHead() {return head;}
    __device__  void setHead(byte* newHead) {head=newHead;}

    template 
    __device__  T* alloc(size_t count) {
      size_t addr = head;
      size_t alignment = __alignof(T); //assuming alignment is power of 2
      addr = ((addr-1) | (alignment-1)) +1; //round up to match the alignment requirement
      head = (byte*)(addr);
      T* var = (T*)(head);
      head+=sizeof(T)*size;
      return allocAt(head,count);
    }

template 
    __device__  T& alloc() {
      return *alloc(1);
    }

};

You can use getHead/setHead to reclaim shared memory when you know it can be reclaimed, but only in a stack manner.

This approach should be easy to abstract over non-shared memory when CUDA is not your target.

Then you should be able to write:

__global__
 void test() {
   SharedMemoryManager shMem<1024>();

   int& xValue = shMem.alloc();
   int& yValue = shMem.alloc();
   int* xArray = shMem.alloc(5);
   int* yArray = shMem.alloc(5);

   xArray[0] = 1;
   yArray[0] = 0;
   printf("%i %i

", xArray[0], yArray[0]);
   __syncthreads();
   shMem.reset(); //memory reclaimed
   ... 
}

CUDA shared memory wrapped in templated class, points to same memory

Answers (1)

Related Questions