Reputation: 2584
I'm trying to copy 2 arrays from global memory to shared memory:
double
and they have 32 elements each.NumberThreadPerBlock
is 32Code:
__global__ void kernel_0(double px[], double py[], int N)
{
int ii,
jj,tid;
tid=blockIdx.x*blockDim.x + threadIdx.x;
__shared__ double s_px[256];
__shared__ double s_py[256];
__shared__ double s[256];
s_px[threadIdx.x]=px[tid];
s_py[threadIdx.x]=py[tid];
s[threadIdx.x]=py[tid];
__syncthreads();
}
int main (int argc, char *argv[]){
double *px, *py , *x, *y, PI, step, *d_x, *d_y,*d_px, *d_py,sharedMemSize;
int N, Nx, ii;
PI = 4*atan(1.0);
Nx = 10000;
N = 32;
px = (double *) malloc(N*sizeof(double));
py = (double *) malloc(N*sizeof(double));
// lookup table: sin // from 0 to PI
step = 1.0 / (N-1);
for (ii = 0; ii < N; ii++){
px[ii] = ii*step*PI;
py[ii] = sin(px[ii]);
}
cudaMalloc( (void **) &d_px, N*sizeof(double) );
cudaMalloc( (void **) &d_py, N*sizeof(double) );
cudaMemcpy( d_px, px, N*sizeof(double), cudaMemcpyHostToDevice );
cudaMemcpy( d_py, py, N*sizeof(double), cudaMemcpyHostToDevice );
dim3 dimGrid(Nx);
dim3 dimBlock(N,1,1);
kernel_0<<< dimGrid, dimBlock>>>(px, py, N);
}
It compiles but cuda-memmcheck
shows me many errors:
========= Invalid __global__ read of size 8
========= at 0x00000058 in kernel_0
========= by thread (31,0,0) in block (6,0,0)
========= Address 0x11e0db38 is out of bounds
=========
========= ERROR SUMMARY: 96 errors
Can you help me?
Upvotes: 0
Views: 2144
Reputation: 2179
From what I can see, the memory allocated to the device pointers (px
, py
) is 32*sizeof(double)
big, however the number of blocks you have is 10000.
Device memory is global and all blocks share it, only the shared memory is defined for each block.
Therefore for blockId.x >= 1
, you should get an invalid memory access.
Moreover, in the kernel launch, it should be d_px
, d_py
.
Upvotes: 1