Reputation: 859
I have a code looks like:
Array a
contains information about group it's elements belong to. i.e. element i
belongs to group a[i]
. Each group can contains two elements. I want to store this information in b whose length is 2*number
of groups. So, value at b[j]
and b[j+1]
would give me the elements that belong to group j/2
(integer division) and j
are even.
void assign(int *a, int *b){
for(int i = 0; i<N; i++){
int group = a[i];
int posit=2*group;
if(b[2*i]!=0){
posit++;
}
b[posit] = i;
}
}
N as is clear length of a.
By default value in b[] is zero and indicates absence of elements.
There is clear data dependency, and it does not look that easy to parallelize this. I am looking for further suggestions, if there is an smart way that I am missing.
Upvotes: 2
Views: 75
Reputation: 21778
In general, you could try to simply use a parallel sort algorithm, on an array containing pairs {i, a[i]}
. Maybe there is a quicker generic method that I don't see at the moment...
In CUDA specifically, however, you can take advantage from the fact that when you have 2 conflicting threads writing a 32-bit word to the same memory location -- one is guaranteed to succeed (you don't know which one though). Formally, it acts as an Arbitrary CRCW machine.
Thus, you can solve your problem in 2 kernel calls:
__global__ void assign1(int* a, int* b, int elementCount) {
int idx = threadIdx.x + blockIdx.x*blockDim.x;
if (idx<elementCount) {
int group = a[idx];
b[2*group] = idx;
}
}
__global__ void assign2(int* a, int* b, int elementCount) {
int idx = threadIdx.x + blockIdx.x*blockDim.x;
if (idx<elementCount) {
int group = a[idx];
if (b[2*group] != idx)
b[2*group+1] = idx;
}
}
__host__ void assign(int* dev_a, int* dev_b, int elementCount) {
int gridSize = elementCount/512+1;
int blockSize = 512;
assign1<<<gridSize,blockSize>>>(dev_a, dev_b, elementCount);
assign2<<<gridSize,blockSize>>>(dev_a, dev_b, elementCount);
}
In assign1
up to 2 threads write to the same memory location b[2*group]
. One of these threads is guaranteed to succeed. In assign2
the thread that failed the write, repeats the process with b[2*group+1]
.
This approach could be repeated if there were up to 3 or 4 elements in the group, but as the number gets higher, it can quickly stop being feasible.
Upvotes: 1