Reputation: 4663
In the following code:
__managed__ int mData[1024];
void foo(int* dataOut)
{
some_kernel_that_writes_to_mdata<<<...>>>();
// cudaDeviceSynchronize() // do I need this synch here?
memcpy(dataOut, mData, sizeof(int) * 1024);
...
cudaDeviceSynchronize();
}
do I need synchronization between the kernel
and memcpy
?
cudaMemcpy
documentation mentions that the function exhibits synchronous behavior for most use cases. But what about "normal" memcpy
from/to managed memory? In my tests it seems the synchronization happens implicitly, but I can't find that in documentation.
Upvotes: 2
Views: 234
Reputation: 152143
Yes, you need that synchronization.
The kernel launch is asynchronous. Therefore the CPU thread will continue on to the next line of code, after launching the kernel, without any guarantee that the kernel completes.
If your subsequent copy operation is expecting to pick up data modified by the kernel, it's necessary to force the kernel to complete first.
cudaMemcpy
is a special case. It is issued into the default stream. It has both a device synchronizing characteristic (forces all previously issued work to that device to complete, before it begins the copy), as well as a CPU thread blocking characteristic (it does not return from the library call, i.e. allow the CPU thread to proceed, until the copy operation is complete.)
(that synchronization would also be required in a pre-pascal UM regime. The fact that you are not getting a seg fault suggests to me that you are in a demand-paged UM regime.)
Upvotes: 5