Overlap data transfer and kernel execution using CUDA stream

Question

I want to optimize my CUDA program by overlapping data transfer with kernel execution. But the sample program asyncAPI.cu in CUDA SDK is too simple to help.

I did search this problem, and found some tutorials use two CUDA streams to achieve overlapping. In my case, a huge amount of data need to be computed, so I need to loop through and dispatch a portion of data to GPU each iteration. But I don't how to write such a loop, because all operations are asynchronous and I am afraid the transferring data will erase/cover those currently under computing.

Does any one has experienced this?
Any help will be grateful.

jmsu · Accepted Answer

One thing you should keep in mind is that operations in the same stream will be executed in order and will only overlap with operations in other streams. When I worked with streams my approach was to have separate memory locations for each stream to use. This would eliminate problems of synchronization between streams. If this is not an option for you because of memory constraints or you need to share data between kernels you have to program synchronization yourself.

Also, if you do any calls on the default stream this steam will wait for all other streams to finish before executing and no other streams can execute while there is something running on the default stream.

Hope this helps.

Overlap data transfer and kernel execution using CUDA stream

Answers (1)

Related Questions