Regis Portalez
Regis Portalez

Reputation: 4860

cuda group by and atomic counters

I have an unordered array of labelled elements :

[0,1,0,2,0,1,2] // labels only

Which I want to sort:

[0,0,0,1,1,2,2]

I already have counted how many elements are labelled with each label and reduced as an array of offsets:

[0,3,5]

meaning I know that I need to store all 0-labelled elements starting at position 0, 1-labelled elements starting at position 3 and so on.

template<typename T>
__global__ void GroupBy(T* output, T* input, int count, int* offsets) {
    int index = threadIdx.x + blockDim.x * blockIdx.x;
    T elem = input[index];
    output[offsets[elem.label]] = elem; // problem here
    atomicAdd(offsets[label], 1);
}

However, read write operation before the atomicAdd is not atomic, so I have memory race-condition. I would not have that for a single counter, since

int count = 0;
atomicAdd(&count, 1);
output[count] = elem;

would indeed yield me a unique counter per thread.

How can I solve that issue and have a dynamic array of atomic counters?

Upvotes: 1

Views: 403

Answers (1)

Regis Portalez
Regis Portalez

Reputation: 4860

Thanks to the kind reminder of talonmies about the return value of atomicAdd, I've beed able to fix my kernel to this:

template<typename T>
__global__ void GroupBy(T* output, T* input, int count, int* offsets) {
    int index = threadIdx.x + blockDim.x * blockIdx.x;
    T elem = input[index];
    int oldOffset = atomicAdd(&offsets[elem.label], 1);
    output[oldOffset] = elem; 
}

Indeed, atomicAdd atomically increments what's stored at first argument, and returns the old value:

[atomicAdd(address, val)] reads the 16-bit, 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd

Upvotes: 2

Related Questions