Reputation: 71
I want to overlap data transfers and kernel executions in a form like this:
int numStreams = 3;
int size = 10;
for(int i = 0; i < size; i++) {
cuMemcpyHtoDAsync( _bufferIn1,
_host_memoryIn1 ),
_size * sizeof(T),
cuda_stream[i % numStreams]);
cuMemcpyHtoDAsync( _bufferIn2,
_host_memoryIn2,
_size * sizeof(T),
cuda_stream[i % numStreams]);
cuLaunchKernel( _kernel,
gs.x(), gs.y(), gs.z(),
bs.x(), bs.y(), bs.z(),
_memory_size,
cuda_stream[i % numStreams],
_kernel_arguments,
0
);
cuEventRecord(event[i], cuda_stream);
}
for(int i = 0; i < size; i++) {
cuEventSynchronize(events[i]);
cuMemcpyDtoHAsync( _host_memoryOut,
_bufferOut,
_size * sizeof(T),
cuda_stream[i % numStreams]);
}
Is overlapping possible in this case? Currently only the HtoD-transfers overlap with the kernel executions. The first DtoH-transfer is executed after the last kernel execution.
Upvotes: 0
Views: 360
Reputation: 304
Overlapping is possible only when the operations are executed on different streams. CUDA operations in the same stream are executed sequentially by the host calling order so that the copy from the device to host at the end will be executed once all the operations on corresponding streams are completed. The overlap doesn't happen because both the last kernel and the first copy are executed on stream 0, so the copy has to wait for the kernel to finish. Since you are synchronizing with an event at each loop iteration, the other copies on the other streams (stream 1 and 2) are not called yet.
Upvotes: 2