user2178841
user2178841

Reputation: 859

Parallelizing a code with dependency

I have a code looks like:

Array a contains information about group it's elements belong to. i.e. element i belongs to group a[i]. Each group can contains two elements. I want to store this information in b whose length is 2*number of groups. So, value at b[j] and b[j+1] would give me the elements that belong to group j/2 (integer division) and j are even.

void assign(int *a, int *b){
    for(int i = 0; i<N; i++){
        int group = a[i];
        int posit=2*group;
        if(b[2*i]!=0){
            posit++;
        }
        b[posit] = i;
    }
}

N as is clear length of a.

By default value in b[] is zero and indicates absence of elements.

There is clear data dependency, and it does not look that easy to parallelize this. I am looking for further suggestions, if there is an smart way that I am missing.

Upvotes: 2

Views: 75

Answers (1)

CygnusX1
CygnusX1

Reputation: 21778

In general, you could try to simply use a parallel sort algorithm, on an array containing pairs {i, a[i]}. Maybe there is a quicker generic method that I don't see at the moment...

In CUDA specifically, however, you can take advantage from the fact that when you have 2 conflicting threads writing a 32-bit word to the same memory location -- one is guaranteed to succeed (you don't know which one though). Formally, it acts as an Arbitrary CRCW machine.

Thus, you can solve your problem in 2 kernel calls:

__global__ void assign1(int* a, int* b, int elementCount) {
    int idx = threadIdx.x + blockIdx.x*blockDim.x;
    if (idx<elementCount) {
        int group = a[idx];
        b[2*group] = idx;
    }
}

__global__ void assign2(int* a, int* b, int elementCount) {
    int idx = threadIdx.x + blockIdx.x*blockDim.x;
    if (idx<elementCount) {
        int group = a[idx];
        if (b[2*group] != idx)
            b[2*group+1] = idx;
    }
}

__host__ void assign(int* dev_a, int* dev_b, int elementCount) {
    int gridSize = elementCount/512+1;
    int blockSize = 512;
    assign1<<<gridSize,blockSize>>>(dev_a, dev_b, elementCount);
    assign2<<<gridSize,blockSize>>>(dev_a, dev_b, elementCount);
}

In assign1 up to 2 threads write to the same memory location b[2*group]. One of these threads is guaranteed to succeed. In assign2 the thread that failed the write, repeats the process with b[2*group+1].

This approach could be repeated if there were up to 3 or 4 elements in the group, but as the number gets higher, it can quickly stop being feasible.

Upvotes: 1

Related Questions