Reputation: 106
Say you have a fairly typical sequence of async compute:
Now if you're running in a loop, the host will loop back to step 1 as 2&3 are non-blocking for the host.
The question is: "What happens for the host if it gets to step 2 again and the system isn't yet done transferring data to the device?"
Does the host MemCpyAsync block until the previous copy is complete?
Does it get launched like normal with the outbound data being put in a buffer?
If the latter, presumably this buffer can run out of space if your host is running too fast wrt the device operations?
I'm aware that modern devices have multiple copy engines, but I'm not sure if those would be useful for multiple copies on the same stream and to the same place.
I get that a system that ran into this wouldn't be a well designed one - asking as a point of knowledge.
This isn't something I have encountered in code yet - looking for any pointers to documentation on how this behavior is supposed to work. Already looked at the API page for the copy function and async behavior and didn't see anything I could recognize as relevant.
Upvotes: 1
Views: 753
Reputation: 152143
Does the host MemCpyAsync block until the previous copy is complete?
No, generally not, assuming by "block" you mean block the CPU thread. Asynchronous work items issued to the GPU go into a queue, and control is immediately returned to the CPU thread, before the work has begun. The queue can hold "many" items. Issuance of work can proceed until the queue is full without any other hindrances or dependencies.
It's important to keep in mind one of the two rules of stream semantics:
So let's say we had a case like this (and assume h_ibuff
and h_obuff
point to pinned host memory):
cudaStream_t stream;
cudaStreamCreate(&stream);
for (int i = 0; i < frames; i++){
cudaMemcpyAsync(d_ibuff, h_ibuff, cudaMemcpyHostToDevice, stream);
kernel<<<...,stream>>>(...);
cudaMemcpyAsync(h_obuff, d_obuff, cudaMemcpyDeviceToHost, stream);
}
on the second pass of the loop, the cudaMemcpyAsync
operations will be inserted into a queue, but will not begin to execute (or do anything, really) until stream semantics say they can begin. This is really true for each and every op issued by this loop.
A reasonable question might be "what if on each pass of the loop, I wanted different contents in h_ibuff
?" (quite sensible). Then you would need to address that specifically. Inserting a simple memcpy
operation for example, by itself, to "reload" h_ibuff
isn't going to work. You'd need some sort of synchronization. For example you might decide that you wanted to "refill" h_ibuff
while the kernel and subsequent cudaMemcpyAsync
D->H operation are happening. You could do something like this:
cudaStream_t stream;
cudaEvent_t event;
cudaEventCreat(&event);
cudaStreamCreate(&stream);
for (int i = 0; i < frames; i++){
cudaMemcpyAsync(d_ibuff, h_ibuff, cudaMemcpyHostToDevice, stream);
cudaEventRecord(event, stream);
kernel<<<...,stream>>>(...);
cudaMemcpyAsync(h_obuff, d_obuff, cudaMemcpyDeviceToHost, stream);
cudaEventSynchronize(event);
memcpy(h_ibuff, databuff+i*chunksize, chunksize); // "refill"
}
This refactoring would allow the asynchronous work to be issued to keep the GPU busy, and "overlap" the copy to "refill" the h_ibuff
. It will also prevent the "refill" operation from beginning until the previous buffer contents are safely transferred to the device, and will also prevent the next buffer copy from beginning until the new contents are "reloaded".
This isn't the only way to do this; it's one possible approach.
For this last question asked/answered above, you might ask a similar question: "how about handling the output buffer side?" The mechanism could be very similar, left to the reader.
For structured learning on this topic, you might wish to study the CUDA concurrency section of this lecture series.
Upvotes: 4