copy global memory by CUDA threads

Question

I need to copy one array in global memory to another array in global memory by CUDA threads (not from the host).

My code is as follows:

__global__ void copy_kernel(int *g_data1, int *g_data2, int n)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int start, end;
  start = some_func(idx);
  end = another_func(idx);
  unsigned int i;
  for (i = start; i < end; i++) {
      g_data2[i] = g_data1[idx];
  }
}

It is very inefficient because for some idx, the [start, end] region is very large, which makes that thread issue too many copy commands. Is there any way to implement it efficiently?

Thank you,

Zheng

Chris Hasiński · Accepted Answer

try using this:

CUresult cuMemcpyDtoD(
    CUdeviceptr dst,
    CUdeviceptr src,
    unsigned int bytes   
)

UPDATE:

You're right: http://forums.nvidia.com/index.php?showtopic=88745

There is no efficient way to do this properly because the design of CUDA wants you to use only small amount of data in the kernel.

copy global memory by CUDA threads

Answers (2)

Related Questions