GoT
GoT

Reputation: 550

Which is better ? Loop inside kernel or Looping kernels for CUDA GPU

Device GeForce GTX 680

In the program, i have very long array to be processed inside kernel.(Approx 1 GB of integers).As per need,My array is divided into blocks sequentially with some overlap(overlap between blocks is k). I have fixed size of each block(block size is m) .Now, array will be divided in order (0,m) (m-k, (m-k) +m) ,....)

As per above calculation, no of blocks needed in my program will be approximately (1GB / m) Since total number of blocks is limited in GPU, how can i effectively do it?. Should i call kernel in iterative manner from host without any loops inside kernel?? or should i call kernel once and then loop inside kernel for multiple iterations? or should i call kernel only once with total no of blocks = (1 GB /m) ??

What can be put as best value for number of blocks for this program and what methods?

Upvotes: 0

Views: 2567

Answers (1)

Roger Dahl
Roger Dahl

Reputation: 15734

I would suggest the following sequence for the first version of your app:

Init:

  • allocate room on the GPU for two non-overlapping blocks of the array on the GPU (slot 1 and 2)
  • copy the first non-overlapping block to slot 1

Loop:

  • copy the next non-overlapping block to slot 2
  • run a kernel that runs on slot 1 and partially into slot 2
  • copy contents of slot 2 to slot 1 (GPU to GPU memory copy)

In a later version, you can avoid the GPU to GPU copy by copying alternately into slot 1 and slot 2 and wrapping the addressing around in the kernel so that instead of overflowing slot 2, it starts at the beginning of slot 1. Think of it as slot 1 and slot 2 being arranged into a ring buffer. You can also improve performance by adding more slots and asynchronously copying blocks of the array to new slots while the kernel is running on previous slots.

Upvotes: 1

Related Questions