Manolete
Manolete

Reputation: 3517

CUDA indices for loops with counters

I've got a nested loop with a counter in between. I've managed to use CUDA indices for the outer loop, but I can't think of any way of getting exploiting more parallelism in this kind of loops. Do you have any experience working with something similar to that?

int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < Nx) {
    counter = 0;
    for (k = 0; k < Ny; k++) {

        d_V[i*Ny + k] = 0;

        if ( d_X[i*Ny + k] >= 2e2 ) {

             /* do stuff with i and k and counter i.e.*/
                d_example[i*length + counter] = k;
                    ...
             /* increment counter */
             counter++;
        }
    }
}

The problem that I see is how to deal with counter, as k could be also be indexed in CUDA with threadIdx.y + blockIdx.y * blockDim.y

Upvotes: 0

Views: 1072

Answers (3)

Tomasz Dzięcielewski
Tomasz Dzięcielewski

Reputation: 3907

If it's possible, you can use cudpp or thrust (libraries, which implement parallel functions like remove_if or compact - something, what you have in example).

Cudpp

Thrust

You can find on these pages simple examples, how to use them. I prefer cudpp, because IMHO is faster than thrust.

Upvotes: 1

Marius Brendmoe
Marius Brendmoe

Reputation: 365

Note that you can use threadIDx.y as the second index in your array. For more info see here: http://www.cs.sunysb.edu/~mueller/teaching/cse591_GPU/threads.pdf

For instance, if you have blocks in two dimensions, you can use threadix.x and threadix.y as your indicies and add the offset of the workgroup (blockidx.x * blockDim.x) as your offset.

As branching is very expensive on GPUs, and all threads in a given workgroup will always wait for all tasks in the group to continue, it is better to simply compute all elements and discard the ones you do not need, if this is possible, this could potentially avoid the use of the counter entierly. If not, the best solution is to use the atomic increment features of the CUDA api on a global counter as specified by phoad in his comment.

Upvotes: 1

Phil H
Phil H

Reputation: 20151

Having a counter/loop variable which is used between loop iterations is a natural antithesis to parallelisation. Ideal parallel loops have iterations which could run in any order, with no knowledge of each other. Unfortunately a common variable makes it both order dependent and mutually aware.

It looks like you're using the counter to pack your d_example array without gaps. This kind of thing could well be more efficient in compute time by wasting some memory; if you let the elements of d_example which won't be set stay as zero, by inefficiently packing d_example, you can perform a filter on d_example later, after any expensive computational steps.

In fact you could even leave the filtration to a modified iterator when the array is read, which just skips over any zero values. If zero is a valid value in the array, just use a particular NaN value or a separate mask array.

int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < Nx) {
    for (k = 0; k < Ny; k++) {

        d_V[i*Ny + k] = 0;

        if ( d_X[i*Ny + k] >= 2e2 ) {

             /* do stuff with i and k and counter i.e.*/
                d_example[i*length + i*k] = k;
                d_examask[i*length + i*k] = 1;
                    ...
             /* increment counter */
        } else {
             d_examask[i*length+i*k] = 0;
        }
    }
}

Upvotes: 1

Related Questions