Dynamitos
Dynamitos

Reputation: 511

Count values from array CUDA

I have an array of float values, namely life, of which i want to count the number of entries with a value greater than 0 in CUDA.

On the CPU, the code would look like this:

int numParticles = 0;
for(int i = 0; i < MAX_PARTICLES; i++){
    if(life[i]>0){
        numParticles++;
    }
}

Now in CUDA, I've tried something like this:

__global__ void update(float* life, int* numParticles){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (life[idx]>0){
        (*numParticles)++;
    }
}
//life is a filled device pointer
int launchCount(float* life)
{
    int numParticles = 0;
    int* numParticles_d = 0;
    cudaMalloc((void**)&numParticles_d, sizeof(int));
    update<<<MAX_PARTICLES/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(life, numParticles_d);
    cudaMemcpy(&numParticles, numParticles_d, sizeof(int), cudaMemcpyDeviceToHost);
    std::cout << "numParticles: " << numParticles << std::endl;
}

But for some reason the CUDA attempt always returns 0 for numParticles. How come?

Upvotes: 1

Views: 1578

Answers (2)

kangshiyin
kangshiyin

Reputation: 9779

Your code is actually launching MAX_PARTICLES threads, and multiple thread blocks are executing (*numParticles)++; concurrently. It is a race condition. So you have the result 0, or if you are luck, sometimes a little bigger than 0.

As your attempt to sum up life[i]>0 ? 1 : 0 for all i, you could follow CUDA parallel reduction to implement your kernel, or use Thrust reduction to simplify your life.

Upvotes: 1

talonmies
talonmies

Reputation: 72342

This:

if (life[idx]>0){
    (*numParticles)++;
}

is a read-after write hazard. Multiple threads will be simultaneously attempting to read and write from numParticles. The CUDA execution model does not guarantee anything about the order of simultaneous transactions.

You could make this work by using atomic memory transactions, for example:

if (life[idx]>0){
    atomicAdd(numParticles, 1);
}

This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.

You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.

Upvotes: 3

Related Questions