Hayk Nahapetyan
Hayk Nahapetyan

Reputation: 4550

CUDA Global Array declaration and initialization before kernel call example

I need some help with CUDA global memory. In my project I must declare global array for avoid to send this array at every kernel call.

Edit:

My application can call the kernel more than 1,000 times, and on every call I'm sending him an array with size more than [1000 * 1000], So I think it's taking more time , that's why my app works slowly. So I need declare global array for the GPU and my questions are

  1. How to declare Global array?

  2. How to initialize a global array from CPU before the kernel launch?

Thanks in advance!

Upvotes: 3

Views: 6700

Answers (2)

KiaMorot
KiaMorot

Reputation: 1746

If I understand well this question, which is kind of unclear, you want to use global array and send it to the device in every kernel call. This bad practice leads to high latency because in every kernel call you need to transfer your data to the device. In my experience such practice led to negative speed-up.

An optimal way would be to use what I call flip-flop technique. The way you do it is:

  1. Declare two array in the device. d_arr1 and d_arr2
  2. Copy the data host -> device into one of the arrays.
  3. Pass as kernel's parameters pointers to d_arr1 and d_arr2
  4. Process the data into the kernel.
  5. In consequent kernel calls you exchange the pointers you are passing as parameters

This way you avoid to transfer the data every kernel call. You transfer only at the beginning and at the end of your host loop.

int a, even =0;
for(a=0;a<1000;a++)
{
  if (even % 2 ==0 )
   //call to the kernel(pointer_a, pointer_b)
  else
  //call to the kernel(pointer_b, pointer_a)
}

Upvotes: 0

Robert Crovella
Robert Crovella

Reputation: 151809

Your edited question is confusing because you say you are sending your kernel an array of size 1000 x 1000 but you want to know how to do this using a global array. The only way I know of to send this much data to a kernel is to use a global array, so you are probably already doing this with an array in global memory.

Nevertheless, there are 2 methods, at least, to create and initialize an array in global memory:

1.statically, using __device__ and cudaMemcpyToSymbol, for example:

 #define SIZE 100
 __device__ int A[SIZE];
 ...
 int main(){
   int myA[SIZE];
   for (int i=0; i< SIZE; i++) myA[i] = 5;
   cudaMemcpyToSymbol(A, myA, SIZE*sizeof(int));
   ...
   (kernel calls, etc.)
 }

(device variable reference, cudaMemcpyToSymbol reference)

2.dynamically, using cudaMalloc and cudaMemcpy:

 #define SIZE 100
 ...
 int main(){
   int myA[SIZE];
   int *A;
   for (int i=0; i< SIZE; i++) myA[i] = 5;
   cudaMalloc((void **)&A, SIZE*sizeof(int));
   cudaMemcpy(A, myA, SIZE*sizeof(int), cudaMemcpyHostToDevice);
   ...
   (kernel calls, etc.)
 }

(cudaMalloc reference, cudaMemcpy reference)

For clarity I'm omitting error checking which you should do on all cuda calls and kernel calls.

Upvotes: 6

Related Questions