Reputation: 9912
So far I have written programs where a kernel is called only once in the program
So I have a kernel
__global__ void someKernel(float * d_in ){ //Any parameters
//some operation
}
and I basically do
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// Point to notice HERE
}
It works fine. However this time I want to call the kernel not only once but many times Something like
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// From here
//Some unrelated calculations here
dothis();
dothat();
//Then again the kernel repeteadly
for(k: some_ks)
{
// Do some pre-calculations
//call the kernel
someKernel<< <nblocks,512>> >(.......);
// some post calculations
}
}
My question is should I use some kind of synchronization between calling the kernel the first time and calling the kernel in the for loops (and in each iteration)
Perhaps cudaDeviceSynchronize
or other? or it is not necessary?
Upvotes: 1
Views: 1311
Reputation: 151799
Additional synchronization would not be necessary in this case for at least 2 reasons.
cudaMemcpy
is a synchronizing call already. It blocks the CPU thread and waits until all previous CUDA activity issued to that device is complete, before it allows the data transfer to begin. Once the data transfer is complete, the CPU thread is allowed to proceed.
CUDA activity issued to a single device will not overlap in any way unless using CUDA streams. You are not using streams. Therefore even asynchronous work issued to the device will execute in issue order. Item A and B issued to the device in that order will not overlap with each other. Item A will complete before item B is allowed to begin. This is a principal CUDA streams semantic point.
Upvotes: 9