Which is better ? Loop inside kernel or Looping kernels for CUDA GPU

Question

Device GeForce GTX 680

In the program, i have very long array to be processed inside kernel.(Approx 1 GB of integers).As per need,My array is divided into blocks sequentially with some overlap(overlap between blocks is k). I have fixed size of each block(block size is m) .Now, array will be divided in order (0,m) (m-k, (m-k) +m) ,....)

As per above calculation, no of blocks needed in my program will be approximately (1GB / m) Since total number of blocks is limited in GPU, how can i effectively do it?. Should i call kernel in iterative manner from host without any loops inside kernel?? or should i call kernel once and then loop inside kernel for multiple iterations? or should i call kernel only once with total no of blocks = (1 GB /m) ??

What can be put as best value for number of blocks for this program and what methods?

Roger Dahl · Accepted Answer

I would suggest the following sequence for the first version of your app:

Init:

allocate room on the GPU for two non-overlapping blocks of the array on the GPU (slot 1 and 2)
copy the first non-overlapping block to slot 1

Loop:

copy the next non-overlapping block to slot 2
run a kernel that runs on slot 1 and partially into slot 2
copy contents of slot 2 to slot 1 (GPU to GPU memory copy)

In a later version, you can avoid the GPU to GPU copy by copying alternately into slot 1 and slot 2 and wrapping the addressing around in the kernel so that instead of overflowing slot 2, it starts at the beginning of slot 1. Think of it as slot 1 and slot 2 being arranged into a ring buffer. You can also improve performance by adding more slots and asynchronously copying blocks of the array to new slots while the kernel is running on previous slots.

Which is better ? Loop inside kernel or Looping kernels for CUDA GPU

Answers (1)

Related Questions