eLe
eLe

Reputation: 49

How do I minimize overhead in a CUDA kernel with dynamic array indexing?

I noticed, that using dynamic indices reduces the speed of CUDA Code by a factor of 12 - see the following example:

__global__ void static3Ops(int start, int end, const float* p, const int* prog_dont_use, float* c)
{
    int i = threadIdx.x;
    float buf[5];
    buf[0] = 1.0e7;
    buf[1] = c[i];
    const int prog[] = { 0,1,2,3,4,5 };

    for (long j = start; j < end; j++) {
        buf[2] = p[j];
        buf[3] = buf[prog[0]] + buf[prog[1]];
        buf[4] = buf[prog[2]] - buf[prog[3]];
        buf[1] = buf[prog[4]] * buf[prog[5]];
    }
    c[i] = buf[1];
}

is 12x faster than

__global__ void static3Ops(int start, int end, const float* p, const int* prog, float* c)
{
    int i = threadIdx.x;
    float buf[5];
    buf[0] = 1.0e7;
    buf[1] = c[i];

    for (long j = start; j < end; j++) {
        buf[2] = p[j];
        buf[3] = buf[prog[0]] + buf[prog[1]];
        buf[4] = buf[prog[2]] - buf[prog[3]];
        buf[1] = buf[prog[4]] * buf[prog[5]];
    }
    c[i] = buf[1];
}

Any hint how to minimize that overhead? The dynamic nature is a core feature of my code...so I hardly get around without it...

Note, that the overhead on the CPU is just about 20%.

Upvotes: 2

Views: 702

Answers (4)

Sebastian
Sebastian

Reputation: 1974

Ideas for speed improvement:

If possible, calculate the variations of prog in CUDA code using threadIdx and blockIdx. Calculations are faster than memory accesses.

Be careful with shared memory (you mentioned it to be ignored, but here anyway). You have to make sure that each thread of a block uses a different index. And make sure that the index per thread in a warp goes to a different bank, otherwise you have a performance penalty.

So if you have blocks of size 128 threads and i contains the thread number:

__shared__ float buf[128 * 6];
buf[0] -> buf[0*128 + i];
buf[1] -> buf[1*128 + i];
buf[prog[0]] -> buf[prog[0]*128 + i];
...

As the block size (128) is divisible by 32, each thread within a warp accesses another shared memory bank, even if the prog index is different. thread 0 alsways accesses bank0 and so on.

Alternative

Try to keep the bufs directly in registers instead of shared memory: buf0, buf1, buf2, ...

How to access them with index? Just write an inline function or macro with switch case.

There are 6*6*6*6*6*6 possibilities. You could try optimizing by generating the code for 36 or 216 possibilities and then just call the appopriate one. E.g.

switch(prog01) {
case 0: buf3 = buf0 + buf0; break;
case 1: buf3 = buf0 + buf1; break;
...
case 6: buf3 = buf1 + buf0; break;
...
}

But possibly it is faster, if you do the 6 switches with 6 cases each, then you have less cases/comparisons/jumps.

Best would be: Do half of the switches (e.g. 216) outside the loop, half of the switches inside one of the 216 loops.

Possibly it would be even better to create device function pointers outside the loop and call the corresponding function, which chooses the bufs. But then those functions would have to choose from the bufs as function parameters instead of local variables. Hopefully they are still efficiently stored in registers.

Local accesses generally are quite fast. If you have local switches and jumps you lose computation time. So make sure that the threads of each warp are rather aligned with similar prog parameters.

Please compare and test with complete warps (not just one thread) to get more realistic results including bank collisisions on shared memory.

Upvotes: 0

eLe
eLe

Reputation: 49

Thank you all for your hints!

so far the fastest code I found is as follows:

_global__ void static3OpsShared(int start, int end, const float* prices, const int* __restrict__ prog, float* c)
{
    int i = threadIdx.x;
    __shared__ float buf[5];
    buf[0] = 1.0e7;
    buf[1] = c[i];
    // I never use more than 6 values of prog in a single thread - but each thread has its own set
    // values of prog are ranging from 0...5 
    // Performance needs to focus on what happens within the following loop typically having over 10000 iterations
    for (long j = start; j < end; j++) { 
        buf[2] = prices[j];
        buf[3] = buf[prog[0]] + buf[prog[1]];
        buf[4] = buf[prog[2]] - buf[prog[3]];
        buf[1] = buf[prog[4]] * buf[prog[5]];
    }
    c[i] = buf[1];
}

(Please ignore the shared memory indexing for a moment - I ran this with a single thread so far)

using registers for prog[0]...prog[5] in the form of

r0 = prog[0];

and use buf[r0] instead of buf[prog[0]] seems to be done by the optimizer.

Most improvement I've got by using shared memory for buf[]. restrict did not help somehow. Especially restrict does not apply for buf as values supposed to be re-used.

My conclusion is: -- If registers could be used instead of buf[] the code would be around 5x faster.

Upvotes: 0

einpoklum
einpoklum

Reputation: 131405

Avoid pointer aliasing

First of order of business: Use __restrict on all of your pointers! It's super-important! Read about it here:

CUDA Pro Tip: Optimize to avoid pointer aliasing

Now, beyond that...

Play with your access patterns and try to improve memory locality

If:

  • The size of prog is bounded by a small value, and
  • Accesses to prog are to indices known at compile-time (i.e. not the value, but the index)

Then:

  • Use a kernel-local, plain C-style array, or an std::array-like class, e.g. kat::array from the cuda-kat library's development branch (due disclosure: It's a library I'm working on, so I'm biased here. Also, the array implementation is quite stable). Load their values from the prog pointer you get as a parameter.
  • Arrange your data in memory carefully, so that loading into the prog array can be coalesced. So, for example, the first prog element for the first thread, then the first element for the second thread etc, up to the 31st thread's first prog element.
  • Do all the loads into prog before using any of the values.

If:

  • The size of prog is not bounded by a small value, but
  • You can arrange the use of prog so that for every small, bounded-length stretch of it, the access to it are at a fixed offset from some baseline (for example: loop over i, at iteration i we access prog[k*i + 1], prog[k*i + 3], prog[k*i + 4] only),

Then:

  • Do the same as in the previous case, but for every fixed-length stretch of prog.

If:

  • prog is not-so-small but not-so-large (tens of elements to thousands of elements per thread), and
  • the access pattern into it is random, arbitrary or data-dependent

Then:

  • Load prog into shared memory first.
  • Make sure to load it so that you don't get bank conflicts, i.e. each block thread's equivalent of prog in shared memory should be entirely contained by a single bank.

If none of the above holds, then:

  • Keep the memory layout and access pattern which allow for coalesced access.
  • Try to concentrate loads of data which you expect to be relatively closeby (or maybe it's not so important, this is more of a week tip).

A methodical note

Always remember that whatever you do - profile and analyze it, don't just settle for the bottom-line number. And try to break down changes and profile them separately. For example - first add the __restrict and see what that gives you. The CUDA "nSight compute" should also tell you where your bottlenecks are (though not what to do about them...)

Upvotes: 2

If_You_Say_So
If_You_Say_So

Reputation: 1283

The two possibilities I can think of:

If prog is a small array: Use your own solution! i.e. Use prog just like how it is defined in the top example if prog is really an array with a small number of elements (like your example). But your comment of "the dynamic nature is a core feature of my code" makes it sound like this not an option for you. When I change const int prog[] = { 0,1,2,3,4,5 } to int prog_0 = 0, prog_1 = 1, ... and use prog_0, prog_1, ... instead of prog[], I get the same performance. That indicates that prog[]'s values are directly stored in the registers without involving the global memory. If prog is not a small array or not known at the compilation time, this method may lead to a heavy use of local memory and degrades performance significantly.

If prog is a large array: Have threads load prog into the shared memory in parallel and then access the shared memory accordingly in the rest of your kernel (block-level tiling).

__shared__  int prog_sh[6]; // or dynamically allocate if size is not known
int i = threadIdx.x;
if (i < 6)
    prog_sh[i] = prog[i];
__syncthreads();

// and then use prog_sh instead of prog....

Note that this really does not make sense for a small array with known values like your example, but you would be surprised how much gain you achieve with tiling when working with large arrays. Nonetheless, you should make sure you are able to achieve a high memory bandwidth for concurrent access (see this link) when dealing with the shared memory.

Upvotes: 2

Related Questions