KansaiRobot
KansaiRobot

Reputation: 9912

Is it necessary to use synchronization between two calls to CUDA kernels?

So far I have written programs where a kernel is called only once in the program

So I have a kernel

__global__  void someKernel(float * d_in ){  //Any parameters
//some operation
}

and I basically do

main()
{
   //create an array in device memory
   cudaMalloc(......);
   //move host data to that array
   cudaMemcpy(......,cudaMemcpyHostToDevice);
   //call the kernel
   someKernel<< <nblocks,512>> >(.......);
   //copy results to host memory
   cudaMemcpy(......,cudaMemcpyDeviceToHost);

//  Point to notice HERE
}

It works fine. However this time I want to call the kernel not only once but many times Something like

main()
{
   //create an array in device memory
   cudaMalloc(......);
   //move host data to that array
   cudaMemcpy(......,cudaMemcpyHostToDevice);
   //call the kernel
   someKernel<< <nblocks,512>> >(.......);
   //copy results to host memory
   cudaMemcpy(......,cudaMemcpyDeviceToHost);


// From here
//Some unrelated calculations here
 dothis();
 dothat();
//Then again the kernel repeteadly
 for(k: some_ks)
   {
     // Do some pre-calculations

     //call the kernel
     someKernel<< <nblocks,512>> >(.......);
      
    // some post calculations  

   }
}

My question is should I use some kind of synchronization between calling the kernel the first time and calling the kernel in the for loops (and in each iteration) Perhaps cudaDeviceSynchronize or other? or it is not necessary?

Upvotes: 1

Views: 1311

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151799

Additional synchronization would not be necessary in this case for at least 2 reasons.

  1. cudaMemcpy is a synchronizing call already. It blocks the CPU thread and waits until all previous CUDA activity issued to that device is complete, before it allows the data transfer to begin. Once the data transfer is complete, the CPU thread is allowed to proceed.

  2. CUDA activity issued to a single device will not overlap in any way unless using CUDA streams. You are not using streams. Therefore even asynchronous work issued to the device will execute in issue order. Item A and B issued to the device in that order will not overlap with each other. Item A will complete before item B is allowed to begin. This is a principal CUDA streams semantic point.

Upvotes: 9

Related Questions