Reputation: 387
Basically, I have an if() in my kernel and if the condition is verified I would like to store a new value in dynamic list or array. The problem is that I can't use the threadIdx because it will not be filled in every kernel.
Something like :
__global__ void myKernel(customType *c)
{
int i = threadIdx.x;
//whatever
if(condition)
c->pop(newvalue)
}
In fact I would like to avoid a c[i]=newvalue because at the end I would need to check every c[i] if a value was inserted or not with a for loop in the host code and to fill properly another structure. I thought about thrust but it seems to be an overkill for my "simple" problem.
Hope you can help me find a workaround.
Upvotes: 2
Views: 1403
Reputation: 72345
If I have understood your question correctly, you have two choices.
The first would be to pre-assign each thread an output location, and only have some threads write into their output. This leaves you with an output with gaps in it. You can eliminate the gaps using stream compaction, which is a solved problem in CUDA - a quick google search will turn up a number of options, and both Thrust and CUDPP have compaction functions you could use.
The second choice would be to use a global memory counter and have each thread atomically increment the counter as it uses a location in the output stream, so something like:
unsigned int opos; // set to zero before call
__global__ void myKernel(customType *c)
{
//whatever
if(condition) {
unsigned int pos = atomicAdd(&opos, 1);
c[pos] = newval;
}
}
If you have a Kepler card, and the number of threads expected to emit output is small, the second option will probably be faster. If that isn't the case, stream compaction is probably a better options.
Upvotes: 5
Reputation: 21128
If I understand correctly, your describing a stream compaction. Some, not all threads will create a value and you want to store those values in an array without any gaps.
One way to implement this is using stream compaction algorithms available in Thrust (check out this example). Note that this does require you to perform the operation in two passes.
If you're doing this from within a single thread-block (as opposed to the entire grid) then you could also look at CUB. Each thread would compute a flag indicating if it wants to store a value, do a prefix-sum on the flags to determine each thread's offset in the list, then do the store.
Upvotes: 4