Reputation: 550
Device GeForce GTX 680
In the program, i have very long array to be processed inside kernel.(Approx 1 GB of integers).As per need,My array is divided into blocks sequentially with some overlap(overlap between blocks is k). I have fixed size of each block(block size is m) .Now, array will be divided in order (0,m) (m-k, (m-k) +m) ,....)
As per above calculation, no of blocks needed in my program will be approximately (1GB / m) Since total number of blocks is limited in GPU, how can i effectively do it?. Should i call kernel in iterative manner from host without any loops inside kernel?? or should i call kernel once and then loop inside kernel for multiple iterations? or should i call kernel only once with total no of blocks = (1 GB /m) ??
What can be put as best value for number of blocks for this program and what methods?
Upvotes: 0
Views: 2567
Reputation: 15734
I would suggest the following sequence for the first version of your app:
Init:
Loop:
In a later version, you can avoid the GPU to GPU copy by copying alternately into slot 1 and slot 2 and wrapping the addressing around in the kernel so that instead of overflowing slot 2, it starts at the beginning of slot 1. Think of it as slot 1 and slot 2 being arranged into a ring buffer. You can also improve performance by adding more slots and asynchronously copying blocks of the array to new slots while the kernel is running on previous slots.
Upvotes: 1