Does memcpy from/to unified memory exhibits synchronous behavior?

Question

In the following code:

__managed__ int mData[1024];

void foo(int* dataOut)
{
    some_kernel_that_writes_to_mdata<<<...>>>();
    // cudaDeviceSynchronize() // do I need this synch here?
    memcpy(dataOut, mData, sizeof(int) * 1024);

    ...

    cudaDeviceSynchronize();
}

do I need synchronization between the kernel and memcpy?

cudaMemcpy documentation mentions that the function exhibits synchronous behavior for most use cases. But what about "normal" memcpy from/to managed memory? In my tests it seems the synchronization happens implicitly, but I can't find that in documentation.

Robert Crovella · Accepted Answer

Yes, you need that synchronization.

The kernel launch is asynchronous. Therefore the CPU thread will continue on to the next line of code, after launching the kernel, without any guarantee that the kernel completes.

If your subsequent copy operation is expecting to pick up data modified by the kernel, it's necessary to force the kernel to complete first.

cudaMemcpy is a special case. It is issued into the default stream. It has both a device synchronizing characteristic (forces all previously issued work to that device to complete, before it begins the copy), as well as a CPU thread blocking characteristic (it does not return from the library call, i.e. allow the CPU thread to proceed, until the copy operation is complete.)

(that synchronization would also be required in a pre-pascal UM regime. The fact that you are not getting a seg fault suggests to me that you are in a demand-paged UM regime.)

Does memcpy from/to unified memory exhibits synchronous behavior?

Answers (1)

Related Questions