Zheng
Zheng

Reputation: 5

copy global memory by CUDA threads

I need to copy one array in global memory to another array in global memory by CUDA threads (not from the host).

My code is as follows:

__global__ void copy_kernel(int *g_data1, int *g_data2, int n)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int start, end;
  start = some_func(idx);
  end = another_func(idx);
  unsigned int i;
  for (i = start; i < end; i++) {
      g_data2[i] = g_data1[idx];
  }
}

It is very inefficient because for some idx, the [start, end] region is very large, which makes that thread issue too many copy commands. Is there any way to implement it efficiently?

Thank you,

Zheng

Upvotes: 0

Views: 1207

Answers (2)

Pavan Yalamanchili
Pavan Yalamanchili

Reputation: 12109

The way you wrote it, I am guessing each thread is trying to write the whole 'start' to 'end' chunk. Which is really really inefficient.

you need to do something like this.

___shared___ unsigned sm_start[BLOCK_SIZE];
___shared___ unsigned sm_end[BLOCK_SIZE];
sm_start[threadIdx.x] = start;
sm_end[threadIdx.y] = end;
__syncthreads();
for (int n = 0; n < blockdDim.x; n++) {
g_data2 += sm_start[n];
unsigned lim = sm_end[n] - sm_start[n];
  for (int i = threadIdx.x; i < lim; i += blockDim.x) {
      g_data2[i] = g_data1[idx];
  }
}

Upvotes: 1

Chris Hasiński
Chris Hasiński

Reputation: 3085

try using this:

CUresult cuMemcpyDtoD(
    CUdeviceptr dst,
    CUdeviceptr src,
    unsigned int bytes   
)   

UPDATE:

You're right: http://forums.nvidia.com/index.php?showtopic=88745

There is no efficient way to do this properly because the design of CUDA wants you to use only small amount of data in the kernel.

Upvotes: 0

Related Questions