Reputation: 3438
In the piece of code here I came across an struct for the shared memory definition and usages. I modified the allocation to be static and used it in a test program like below:
#include <stdio.h>
template<class T, uint bDim>
struct SharedMemory
{
__device__ inline operator T *() {
__shared__ T __smem[ bDim ];
return (T*) (void *) __smem;
}
__device__ inline operator const T *() const {
__shared__ T __smem[ bDim ];
return (T*) (void *) __smem;
}
};
template <uint bDim>
__global__ void myKernel() {
SharedMemory<uint, bDim> myShared;
myShared[ threadIdx.x ] = threadIdx.x;
__syncthreads();
printf("%d\tsees\t%d\tat two on the circular right.\n", threadIdx.x, myShared[ ( threadIdx.x + 2 ) & 31 ]);
}
int main() {
myKernel<32><<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}
It works fine as predicted. However, I have a few questions about this usage:
I don't understand the syntax used in the operator overloading section in the sharedMemory
struct. Is it overloading the dereference operator *
? If yes, how accesses via square bracket translate into dereference pointer? Also, why does changing __device__ inline operator T *() {
line into __device__ inline T operator *() {
produce compiler errors?
I wanted to ease the use of the wrapper by overloading the assignment operator or defining a member function, so that each thread updates the shared memory location corresponding to its thread index. So that, for example, writing down myShared = 47;
or myShared.set( 47 );
translates into myShared[threadIdx.x] = 47;
behind the curtain. But I have been unsuccessful doing this. It compiles fine but the shared memory buffer is read all 0
(which I think is the default shared memory initialization in the Debug mode). Can you please let me know where I'm doing things wrong? Here's my try:
template<class T, uint bDim>
struct SharedMemory
{
__device__ inline operator T*() {
__shared__ T __smem[ bDim ];
return (T*) (void *) __smem;
}
__device__ inline operator const T *() const {
__shared__ T __smem[ bDim ];
return (T*) (void *) __smem;
}
__device__ inline T& operator=( const T& __in ) {
__shared__ T __smem[ bDim ];
__smem[ threadIdx.x ] = __in;
return (T&) __smem[ threadIdx.x ];
}
__device__ inline void set( const T __in ) {
__shared__ T __smem[ bDim ];
__smem[ threadIdx.x ] = __in;
}
};
For the member function, the compiler gives out a warning:
variable "__smem" was set but never used
Although I am aware member variables cannot be __shared__
, I'm thinking I have a wrong assumption about or what I want to do is not matched with the __shared__
qualifier characteristics. I appreciate the help.
Upvotes: 3
Views: 1198
Reputation: 72342
It appears you had a few misunderstandings about what the __shared__
access specifier actually does in CUDA and that, combined with a rather tricky template designed to fool the compiler for the case where extern __shared__
memory is used in templated kernel instances, led you down a blind path.
If I have understood your need correctly, what you really are looking for is something like this:
template<typename T>
struct wrapper
{
T * p;
unsigned int tid;
__device__ wrapper(T * _p, unsigned int _tid) : p(_p), tid(_tid) {}
__device__ const T* operator->() const { return p + tid; }
__device__ T& operator*() { return *(p + tid); }
__device__ const T& operator*() const { return *(p + tid); }
};
This is a wrapper which you can use to "hide" a pointer and an offset to have "indexing" free access to the pointer, for example:
#include <cstdio>
// structure definition goes here
void __global__ kernel(float *in)
{
__shared__ float _buff[32];
wrapper<float> buff(&_buff[0], threadIdx.x);
*buff = in[threadIdx.x + blockIdx.x * blockDim.x];
__syncthreads();
for(int i=0; (i<32) && (threadIdx.x == 0); ++i) {
printf("%d %d %f\n", blockIdx.x, i, _buff[i]);
}
}
int main()
{
float * d = new float[128];
for(int i=0; i<128; i++) { d[i] = 1.5f + float(i); }
float * _d;
cudaMalloc((void **)&_d, sizeof(float) * size_t(128));
cudaMemcpy(_d, d, sizeof(float) * size_t(128), cudaMemcpyHostToDevice);
kernel<<<4, 32>>>(_d);
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
In the example kernel, the shared memory array _buff
is wrapped with the thread index within a wrapper instance, and the operator overloads let you access a specific array element without the usual explicit indexing operation. Perhaps you can modify this to suit your needs.
Upvotes: 4