user1760748
user1760748

Reputation: 149

CUDA: how to move array elements

I need to move each of the first k elements of a 1-D array by an offset, wherethe offsets are monotonically increasing, i.e., if the offset for element i is offset1 then element i+1 has offset, offset2, that satisfies: offset2 >= offset1.

I wrote a kernel that is executed on each of the first k elements:

if (thread_id < k) {

  // compute offset

  if (offset) {
    int temp = a[thread_id];

    __synchthreads();

    a[thread_id + offset] = temp;
  }
}

However, when tested for k = 3, the offset are indeed monotonically increasing, namely 0, 1, 1. Element 0 stays in its position as expected. However, element 1 gets copied to not only element 2 (according to the offset for element 1), but also to element 3.

That is, it appears that thread 2 reads element 2 and stores it into its copy of temp only after thread 1 has completed the copy of element 1 to element 2.

What am I doing wrong and how to fix it?

Thank you!

Upvotes: 1

Views: 1279

Answers (1)

harrism
harrism

Reputation: 27809

What you are doing generalizes to a scatter operation:

thread   0  1  2  3  4
in  =  { 1, 4, 3, 2, 5}
idx =  { 1, 2, 3, 4, 0}

out[idx] = in[i]

In general a scatter cannot be done in-place in parallel, because threads read from locations that other threads write. In our example, if thread 2 reads its input location after thread 1 writes its output location, we get incorrect results. This is a race condition, and requires either synchronization or out-of-place storage.

Since synchronization in this case for large arrays is global synchronization, which is not supported in the CUDA programming model, you must use out-of-place scatter.

In other words, you cannot do this:

temp = in[thread_idx]
global-sync
in[thread_idx + offset] = temp

You must do this:

out[i + offset] = in[thread_idx]

Where out does not point to the same memory as in.

Upvotes: 3

Related Questions