Reputation: 3517
I've got a nested loop with a counter in between. I've managed to use CUDA indices for the outer loop, but I can't think of any way of getting exploiting more parallelism in this kind of loops. Do you have any experience working with something similar to that?
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < Nx) {
counter = 0;
for (k = 0; k < Ny; k++) {
d_V[i*Ny + k] = 0;
if ( d_X[i*Ny + k] >= 2e2 ) {
/* do stuff with i and k and counter i.e.*/
d_example[i*length + counter] = k;
...
/* increment counter */
counter++;
}
}
}
The problem that I see is how to deal with counter, as k
could be also be indexed in CUDA with threadIdx.y + blockIdx.y * blockDim.y
Upvotes: 0
Views: 1072
Reputation: 3907
If it's possible, you can use cudpp or thrust (libraries, which implement parallel functions like remove_if or compact - something, what you have in example).
You can find on these pages simple examples, how to use them. I prefer cudpp, because IMHO is faster than thrust.
Upvotes: 1
Reputation: 365
Note that you can use threadIDx.y as the second index in your array. For more info see here: http://www.cs.sunysb.edu/~mueller/teaching/cse591_GPU/threads.pdf
For instance, if you have blocks in two dimensions, you can use threadix.x and threadix.y as your indicies and add the offset of the workgroup (blockidx.x * blockDim.x) as your offset.
As branching is very expensive on GPUs, and all threads in a given workgroup will always wait for all tasks in the group to continue, it is better to simply compute all elements and discard the ones you do not need, if this is possible, this could potentially avoid the use of the counter entierly. If not, the best solution is to use the atomic increment features of the CUDA api on a global counter as specified by phoad in his comment.
Upvotes: 1
Reputation: 20151
Having a counter/loop variable which is used between loop iterations is a natural antithesis to parallelisation. Ideal parallel loops have iterations which could run in any order, with no knowledge of each other. Unfortunately a common variable makes it both order dependent and mutually aware.
It looks like you're using the counter to pack your d_example
array without gaps. This kind of thing could well be more efficient in compute time by wasting some memory; if you let the elements of d_example which won't be set stay as zero, by inefficiently packing d_example
, you can perform a filter on d_example later, after any expensive computational steps.
In fact you could even leave the filtration to a modified iterator when the array is read, which just skips over any zero values. If zero is a valid value in the array, just use a particular NaN value or a separate mask array.
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < Nx) {
for (k = 0; k < Ny; k++) {
d_V[i*Ny + k] = 0;
if ( d_X[i*Ny + k] >= 2e2 ) {
/* do stuff with i and k and counter i.e.*/
d_example[i*length + i*k] = k;
d_examask[i*length + i*k] = 1;
...
/* increment counter */
} else {
d_examask[i*length+i*k] = 0;
}
}
}
Upvotes: 1