Reputation: 3003
I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy.
I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished?
e.g. Does this code provide the same blocking operation?
SomeCudaCall<<<25,34>>>(someData);
cudaThreadSynchronize();
vs
SomeCudaCall<<<25,34>>>(someParam);
cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice);
Upvotes: 8
Views: 7487
Reputation: 445
According to the NVIDIA Programming guide:
In order to facilitate concurrent execution between host and device, some function calls are asynchronous: Control is returned to the host thread before the device has completed the requested task. These are:
- Kernel launches;
- Memory copies between two addresses to the same device memory;
- Memory copies from host to device of a memory block of 64 KB or less;
- Memory copies performed by functions that are suffixed with Async;
- Memory set function calls.
So as long as your transfer size is larger than 64KB your examples are equivalent.
Upvotes: 6
Reputation: 98118
Your examples are equivalent. If you want asynchronous execution you can use streams or contexts and cudaMemcpyAsync, so that you can overlap execution with copy.
Upvotes: 8