Reputation: 4860
I have an unordered array of labelled elements :
[0,1,0,2,0,1,2] // labels only
Which I want to sort:
[0,0,0,1,1,2,2]
I already have counted how many elements are labelled with each label and reduced as an array of offsets:
[0,3,5]
meaning I know that I need to store all 0-labelled elements starting at position 0, 1-labelled elements starting at position 3 and so on.
template<typename T>
__global__ void GroupBy(T* output, T* input, int count, int* offsets) {
int index = threadIdx.x + blockDim.x * blockIdx.x;
T elem = input[index];
output[offsets[elem.label]] = elem; // problem here
atomicAdd(offsets[label], 1);
}
However, read write operation before the atomicAdd is not atomic, so I have memory race-condition. I would not have that for a single counter, since
int count = 0;
atomicAdd(&count, 1);
output[count] = elem;
would indeed yield me a unique counter per thread.
How can I solve that issue and have a dynamic array of atomic counters?
Upvotes: 1
Views: 403
Reputation: 4860
Thanks to the kind reminder of talonmies about the return value of atomicAdd, I've beed able to fix my kernel to this:
template<typename T>
__global__ void GroupBy(T* output, T* input, int count, int* offsets) {
int index = threadIdx.x + blockDim.x * blockIdx.x;
T elem = input[index];
int oldOffset = atomicAdd(&offsets[elem.label], 1);
output[oldOffset] = elem;
}
Indeed, atomicAdd atomically increments what's stored at first argument, and returns the old value:
[atomicAdd(address, val)] reads the 16-bit, 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd
Upvotes: 2