Parallelizing a code with dependency

Question

I have a code looks like:

Array a contains information about group it's elements belong to. i.e. element i belongs to group a[i]. Each group can contains two elements. I want to store this information in b whose length is 2*number of groups. So, value at b[j] and b[j+1] would give me the elements that belong to group j/2 (integer division) and j are even.

void assign(int *a, int *b){
    for(int i = 0; i



By default value in b[] is zero and indicates absence of elements.

There is clear data dependency, and it does not look that easy to parallelize this. I am looking for further suggestions, if there is an smart way that I am missing.

CygnusX1 · Accepted Answer

In general, you could try to simply use a parallel sort algorithm, on an array containing pairs {i, a[i]}. Maybe there is a quicker generic method that I don't see at the moment...

In CUDA specifically, however, you can take advantage from the fact that when you have 2 conflicting threads writing a 32-bit word to the same memory location -- one is guaranteed to succeed (you don't know which one though). Formally, it acts as an Arbitrary CRCW machine.

Thus, you can solve your problem in 2 kernel calls:

__global__ void assign1(int* a, int* b, int elementCount) {
    int idx = threadIdx.x + blockIdx.x*blockDim.x;
    if (idx>>(dev_a, dev_b, elementCount);
    assign2<<>>(dev_a, dev_b, elementCount);
}

In assign1 up to 2 threads write to the same memory location b[2*group]. One of these threads is guaranteed to succeed. In assign2 the thread that failed the write, repeats the process with b[2*group+1].

This approach could be repeated if there were up to 3 or 4 elements in the group, but as the number gets higher, it can quickly stop being feasible.

Parallelizing a code with dependency

Answers (1)

Related Questions