Aggregate results from CUDA threads

Question

I am solving minimal dominant set problem on CUDA. Every thread finds some local candiate result and I need to find the best. I am using __device__ variables for the global result (dev_bestConfig and dev_bestValue).

I need to do something like this:

__device__ configType dev_bestConfig = 0;
__device__ int dev_bestValue = INT_MAX;

__device__ void findMinimalDominantSet(int count, const int *matrix, Lock &lock)
{
    // here is some algorithm that finds local bestValue and bestConfig

    // set device variables
    if (bestValue < dev_bestValue)
    {
        dev_bestValue = bestValue;
        dev_bestConfig = bestConfig;
    }
}

I know that this does not work because more threads accesses the memory at the same time so I use this critical section:

    // set device variables
    bool isSet = false;
    do
    {
        if (isSet = atomicCAS(lock.mutex, 0, 1) == 0)
        {
            // critical section goes here
            if (bestValue < dev_bestValue)
            {
                dev_bestValue = bestValue;
                dev_bestConfig = bestConfig;
            }
        }
        if (isSet)
        {
            *lock.mutex = 0;
        }
    } while (!isSet);

This actually works as expected but it is really slow. For example without this critical section it takes 0.1 secodns and with this critical section it takes 1.8 seconds.

What can i do differetly to make it faster?

Martin Volek · Accepted Answer

I actually avoided any critical sections and locking at the end. I saved local results to an array and then searched for the best one. The searching can be done sequentially or by parallel reduction.

Aggregate results from CUDA threads

Answers (1)

Related Questions