Reputation: 149
I need to move each of the first k elements of a 1-D array by an offset, wherethe offsets are monotonically increasing, i.e., if the offset for element i is offset1 then element i+1 has offset, offset2, that satisfies: offset2 >= offset1.
I wrote a kernel that is executed on each of the first k elements:
if (thread_id < k) {
// compute offset
if (offset) {
int temp = a[thread_id];
__synchthreads();
a[thread_id + offset] = temp;
}
}
However, when tested for k = 3, the offset are indeed monotonically increasing, namely 0, 1, 1. Element 0 stays in its position as expected. However, element 1 gets copied to not only element 2 (according to the offset for element 1), but also to element 3.
That is, it appears that thread 2 reads element 2 and stores it into its copy of temp only after thread 1 has completed the copy of element 1 to element 2.
What am I doing wrong and how to fix it?
Thank you!
Upvotes: 1
Views: 1279
Reputation: 27809
What you are doing generalizes to a scatter operation:
thread 0 1 2 3 4
in = { 1, 4, 3, 2, 5}
idx = { 1, 2, 3, 4, 0}
out[idx] = in[i]
In general a scatter cannot be done in-place in parallel, because threads read from locations that other threads write. In our example, if thread 2 reads its input location after thread 1 writes its output location, we get incorrect results. This is a race condition, and requires either synchronization or out-of-place storage.
Since synchronization in this case for large arrays is global synchronization, which is not supported in the CUDA programming model, you must use out-of-place scatter.
In other words, you cannot do this:
temp = in[thread_idx]
global-sync
in[thread_idx + offset] = temp
You must do this:
out[i + offset] = in[thread_idx]
Where out
does not point to the same memory as in
.
Upvotes: 3