Reputation: 49
I noticed, that using dynamic indices reduces the speed of CUDA Code by a factor of 12 - see the following example:
__global__ void static3Ops(int start, int end, const float* p, const int* prog_dont_use, float* c)
{
int i = threadIdx.x;
float buf[5];
buf[0] = 1.0e7;
buf[1] = c[i];
const int prog[] = { 0,1,2,3,4,5 };
for (long j = start; j < end; j++) {
buf[2] = p[j];
buf[3] = buf[prog[0]] + buf[prog[1]];
buf[4] = buf[prog[2]] - buf[prog[3]];
buf[1] = buf[prog[4]] * buf[prog[5]];
}
c[i] = buf[1];
}
is 12x faster than
__global__ void static3Ops(int start, int end, const float* p, const int* prog, float* c)
{
int i = threadIdx.x;
float buf[5];
buf[0] = 1.0e7;
buf[1] = c[i];
for (long j = start; j < end; j++) {
buf[2] = p[j];
buf[3] = buf[prog[0]] + buf[prog[1]];
buf[4] = buf[prog[2]] - buf[prog[3]];
buf[1] = buf[prog[4]] * buf[prog[5]];
}
c[i] = buf[1];
}
Any hint how to minimize that overhead? The dynamic nature is a core feature of my code...so I hardly get around without it...
Note, that the overhead on the CPU is just about 20%.
Upvotes: 2
Views: 702
Reputation: 1974
Ideas for speed improvement:
If possible, calculate the variations of prog in CUDA code using threadIdx and blockIdx. Calculations are faster than memory accesses.
Be careful with shared memory (you mentioned it to be ignored, but here anyway). You have to make sure that each thread of a block uses a different index. And make sure that the index per thread in a warp goes to a different bank, otherwise you have a performance penalty.
So if you have blocks of size 128 threads and i contains the thread number:
__shared__ float buf[128 * 6];
buf[0] -> buf[0*128 + i];
buf[1] -> buf[1*128 + i];
buf[prog[0]] -> buf[prog[0]*128 + i];
...
As the block size (128) is divisible by 32, each thread within a warp accesses another shared memory bank, even if the prog index is different. thread 0 alsways accesses bank0 and so on.
Try to keep the bufs directly in registers instead of shared memory: buf0, buf1, buf2, ...
How to access them with index? Just write an inline function or macro with switch case.
There are 6*6*6*6*6*6 possibilities. You could try optimizing by generating the code for 36 or 216 possibilities and then just call the appopriate one. E.g.
switch(prog01) {
case 0: buf3 = buf0 + buf0; break;
case 1: buf3 = buf0 + buf1; break;
...
case 6: buf3 = buf1 + buf0; break;
...
}
But possibly it is faster, if you do the 6 switches with 6 cases each, then you have less cases/comparisons/jumps.
Best would be: Do half of the switches (e.g. 216) outside the loop, half of the switches inside one of the 216 loops.
Possibly it would be even better to create device function pointers outside the loop and call the corresponding function, which chooses the bufs. But then those functions would have to choose from the bufs as function parameters instead of local variables. Hopefully they are still efficiently stored in registers.
Local accesses generally are quite fast. If you have local switches and jumps you lose computation time. So make sure that the threads of each warp are rather aligned with similar prog parameters.
Please compare and test with complete warps (not just one thread) to get more realistic results including bank collisisions on shared memory.
Upvotes: 0
Reputation: 49
Thank you all for your hints!
so far the fastest code I found is as follows:
_global__ void static3OpsShared(int start, int end, const float* prices, const int* __restrict__ prog, float* c)
{
int i = threadIdx.x;
__shared__ float buf[5];
buf[0] = 1.0e7;
buf[1] = c[i];
// I never use more than 6 values of prog in a single thread - but each thread has its own set
// values of prog are ranging from 0...5
// Performance needs to focus on what happens within the following loop typically having over 10000 iterations
for (long j = start; j < end; j++) {
buf[2] = prices[j];
buf[3] = buf[prog[0]] + buf[prog[1]];
buf[4] = buf[prog[2]] - buf[prog[3]];
buf[1] = buf[prog[4]] * buf[prog[5]];
}
c[i] = buf[1];
}
(Please ignore the shared memory indexing for a moment - I ran this with a single thread so far)
using registers for prog[0]...prog[5] in the form of
r0 = prog[0];
and use buf[r0] instead of buf[prog[0]]
seems to be done by the optimizer.
Most improvement I've got by using shared memory for buf[]. restrict did not help somehow. Especially restrict does not apply for buf as values supposed to be re-used.
My conclusion is: -- If registers could be used instead of buf[] the code would be around 5x faster.
Upvotes: 0
Reputation: 131405
First of order of business: Use __restrict
on all of your pointers! It's super-important! Read about it here:
CUDA Pro Tip: Optimize to avoid pointer aliasing
Now, beyond that...
If:
prog
is bounded by a small value, andprog
are to indices known at compile-time (i.e. not the value, but the index)Then:
std::array
-like class, e.g. kat::array
from the cuda-kat library's development branch (due disclosure: It's a library I'm working on, so I'm biased here. Also, the array implementation is quite stable). Load their values from the prog
pointer you get as a parameter.prog
array can be coalesced. So, for example, the first prog
element for the first thread, then the first element for the second thread etc, up to the 31st thread's first prog
element.If:
prog
is not bounded by a small value, butprog[k*i + 1]
, prog[k*i + 3]
, prog[k*i + 4]
only),Then:
prog
.If:
prog
is not-so-small but not-so-large (tens of elements to thousands of elements per thread), andThen:
prog
into shared memory first.prog
in shared memory should be entirely contained by a single bank.If none of the above holds, then:
Always remember that whatever you do - profile and analyze it, don't just settle for the bottom-line number. And try to break down changes and profile them separately. For example - first add the __restrict
and see what that gives you. The CUDA "nSight compute" should also tell you where your bottlenecks are (though not what to do about them...)
Upvotes: 2
Reputation: 1283
The two possibilities I can think of:
If prog is a small array: Use your own solution! i.e. Use prog
just like how it is defined in the top example if prog
is really an array with a small number of elements (like your example). But your comment of "the dynamic nature is a core feature of my code" makes it sound like this not an option for you. When I change const int prog[] = { 0,1,2,3,4,5 }
to int prog_0 = 0, prog_1 = 1, ...
and use prog_0
, prog_1
, ... instead of prog[]
, I get the same performance. That indicates that prog[]
's values are directly stored in the registers without involving the global memory. If prog
is not a small array or not known at the compilation time, this method may lead to a heavy use of local memory and degrades performance significantly.
If prog is a large array: Have threads load prog
into the shared memory in parallel and then access the shared memory accordingly in the rest of your kernel (block-level tiling).
__shared__ int prog_sh[6]; // or dynamically allocate if size is not known
int i = threadIdx.x;
if (i < 6)
prog_sh[i] = prog[i];
__syncthreads();
// and then use prog_sh instead of prog....
Note that this really does not make sense for a small array with known values like your example, but you would be surprised how much gain you achieve with tiling when working with large arrays. Nonetheless, you should make sure you are able to achieve a high memory bandwidth for concurrent access (see this link) when dealing with the shared memory.
Upvotes: 2