Reputation: 14660
Consider the two snippets of code.
Snippet1
cudaStream_t stream1, stream2 ;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync( dst, src, size, dir, stream1 );
kernel<<<grid, block, 0, stream2>>>(...);
Snippet2
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpy( dst, src, size, dir, stream1 );
kernel<<<grid, block, 0, stream2>>>(...);
In both snippets I am issuing a memcpy call (snippet1 asynchronous and snippet2 synchronous)
Since the commands have been issued to two different streams, from my understanding there can be potential overlap in both cases.
But in Snippet2 cudaMemcpy call being synchronous (aka blocking) leads to me a paradoxical conclusion that cudaMemcpy and kernel call will be executed one after another.
Which one is the correct conclusion ?
To rephrase more compactly: When we issue the cudaMemcpy call to a stream does it block the "entire code" or just block the stream it was issued to?
Upvotes: 1
Views: 4859
Reputation: 81
ArcheaSoftware is partially correct. Synchronous calls, indeed, do not return control to the CPU until the operation has been completed. In that sense, your kernel launch will only occur after the cudaMemcpy
call returns. However, depending on your buffer types, the kernel might or might not be able to use the data transferred by the cudaMemcpy
call. Some examples below:
Example 1:
cudaMallocHost(&src, size);
cudaMalloc(&dst, size);
cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice);
kernel<<<grid, block, 0, stream2>>>(...);
In this case, the kernel can use the data copied from src
to dst
.
Example 2:
src = malloc(size);
cudaMalloc(&dst, size);
cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice);
kernel<<<grid, block, 0, stream2>>>(...);
In this case, cudaMemcpy
can return before the data is actually transferred to the device.
cudaMemcpy
from unregistered host buffers (e.g., malloc
buffers) only guarantees that the data is copied out of the source buffer, perhaps into an intermediate staging buffer, before the call returns. This is surprising behavior, but is defined as such in the NVIDIA CUDA documents. Ref: https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior
In general, I recommend avoiding using unregistered host buffers because of such behavior.
Upvotes: 1
Reputation: 4422
Synchronous calls do not return control to the CPU until the operation has been completed, so your second snippet will not even begin to submit the kernel launch until after the memcpy is done.
Your cudaMemcpy()
call looks incorrect; I don't think you can specify stream parameters to any variant of memcpy that does not end in "Async." As written, the compiler might accept the code and take the stream as the memcpy direction.
Upvotes: 3