Cuda: Copy host data to shared memory array

Question

I have a struct defined on my host and on my device. In the host I initialize an array of this struct with values.

MyStruct *h_s = (MyStruct *) malloc(objsize*sizeof(MyStruct));
hs[0] = ...

Mystruct *d_s;
cudaMalloc( &d_s, objsize * sizeof(MyStruct));
cudaMemcpy( d_s, h_s, objsize * sizeof(MyStruct), cudaMemcpyHostToDevice );
init<<< gridSize, blockSize >>> ( d_s );

In my kernel I have about 7 functions which should use this array. Some of them are global and some are simple device functions. For simplicity and efficiency i want to use a shared memory array.

__shared__ Mystruct *d_s;

__global__ void init(Mystruct *theStructArray){
   //How to allocate memory for d_s
   //How copy theStructArray to d_s
}

So the question is: How can I allocate memory for the shared array and set its values with the function parameter?

Edit: I am trying to write the smallpt code to CUDA.

struct Sphere {
    double rad;       // radius
    Vec p, e, c;      // position, emission, color
    Refl_t refl;      // reflection type (DIFFuse, SPECular, REFRactive)

    Sphere(){
        rad = 16.5;
        p = (Vec(27,16.5,47) + Vec(73,16.5,78))*0.5;
        e = Vec();
        c = Vec(0.75, 0.75, 0.75);
        refl = DIFF;
    }

    Sphere(double rad_, Vec p_, Vec e_, Vec c_, Refl_t refl_):
        rad(rad_), p(p_), e(e_), c(c_), refl(refl_) {}

    __device__ double intersect(const Ray &r) const { // returns distance, 0 if nohit
        Vec op = p-r.o; // Solve t^2*d.d + 2*t*(o-p).d + (o-p).(o-p)-R^2 = 0
        double t, eps=1e-4, b=op.dot(r.d), det=b*b-op.dot(op)+rad*rad;
        if (det<0) return 0; else det=sqrt(det);
        return (t=b-det)>eps ? t : ((t=b+det)>eps ? t : 0);
    } 
};

talonmies · Accepted Answer

If you understand the scope and size limitations of shared memory, then the question appears to be

how to dynamically reserved memory for the shared memory array
how to use the dynamic shared memory within the kernel

Your kernel becomes something like this:

__shared__ Mystruct *d_s;

__global__ void init(Mystruct *theStructArray){

    int tid = blockDim.x * blockIdx.x + threadIdx.x;

    // load to shared memory array
    // assumes Mystruct has correct copy assignment semantics
    d_s[threadIdx.x] = theStructArray[tid]

    __syncthreads();

    // Each thread has now loaded one value to the block
    // scoped shared array
}

[disclaimer: code written in browser, never compiled or tested, and note the caveat in comments about copy assignment]

The calling host code needs to add an additional argument to the kernel call to reserve memory for the shared array:

MyStruct *h_s = (MyStruct *) malloc(objsize*sizeof(MyStruct));
hs[0] = ...

Mystruct *d_s;
cudaMalloc( &d_s, objsize * sizeof(MyStruct));
cudaMemcpy( d_s, h_s, objsize * sizeof(MyStruct), cudaMemcpyHostToDevice );
init<<< gridSize, blockSize, blockSize * sizeof(MyStruct) >>> ( d_s );

Note the third argument to the <<< >>> stanza of the kernel call. That specifies the number of bytes of memory reserved per block. There hardware dictated limits on the size of the shared memory allocations you can make, and they might have an additional effect on performance beyond the hardware limits.

Shared memory is a very well documented feature of CUDA, I would recommend Mark Harris's blog and this Stack Overflow Question as good starting points on the mechanics of shared memory in CUDA.

Cuda: Copy host data to shared memory array

Answers (1)

Related Questions