cuda copy to shared memory an array larger than thread count

Question

How would someone copy an array of a size x , if a kernel needs to have y threads for some other reason. All the examples I found are copying array of the size of thread count, such as:

__global__ void staticReverse(int *d, int n)
{
  __shared__ int s[64];
  int t = threadIdx.x;

  s[t] = d[t];
  __syncthreads();

How would it look If I need my s[100000] and I need to launch my kernel with just 1 block of 640 threads?
How would it look If I need my s[100000] and I need to launch my kernel with 10 blocks of 64 threads? Here I am also confused as a shared memory is just for 1 block, so I don't understand how threads from other blocks copy to shared memory of which block ?

talonmies · Accepted Answer

How would it look If I need my s[100000] and I need to launch my kernel with just 1 block of 640 threads?

It wouldn't. Currently supported CUDA hardware has a limit of either 48kb or 96kb per block, so an integer shared array with 100000 elements is illegal.

But in general, if you had a design pattern where it was necessary to load more then one data point per thread into shared memory, then you would do something like this

__global__ void staticReverse(int *d, int n)
{
  __shared__ int s[2048];
  int t = threadIdx.x;

  for(; t < 2048; t += blockDim.x) s[t] = d[t];
  __syncthreads();

Because shared memory is strictly block scope, you must use something like the above design pattern to load data to shared storage. Obviously threads from other blocks can never have access to the shared memory of anything other than their own block.

cuda copy to shared memory an array larger than thread count

Answers (1)

Related Questions