inconvergent
inconvergent

Reputation: 267

Atomic Saxpy in CUDA

I have the following problem in CUDA.

Let's assume we have a list of indices where some, or all, indices can be present more than one time:

inds = [1, 1, 1, 2, 2, 3, 4]

With these indices I would like to perform an atomic saxpy operation (in parallel) on a float array, x. I'm not worried about the order in which the operations are applied. That is, I want to do this, for floats a and k:

x[i] = x[i]*a + k;

This would be trivial if there were no duplicate indices in inds.

My current solution (that does not work) is this:

// assume all values in adr are greater than or equal to 0.
// also assume a and k are strictly positive.

__device__ inline void atomicSaxpy(float *adr,
                                   const float a,
                                   const float k){

  float old = atomicExch(adr, -1.0f); // first exchange
  float new_;
  if (old <= -1.0f){
    new_ = -1.0f;
  } else {
    new_ = old*a + k;
  }

  while (true) {
    old = atomicExch(adr, new_); // second exchange
    if (old <= -1.0f){
      break;
    }
    new_ = old*a + k;
  }
}

This seems to return the correct answer in many cases.

Here is what I think happens when you do not get the right answer:

  1. old gets a value of -1.0f in the first exchange. => new_ = -1.0f
  2. old gets a value of -1.0f in the second exchange as well.
  3. The function exits without having had any external effect at all.

A somewhat different approach is this:

__device__ inline void atomicSaxpy(float *adr,
                                   const float ia,
                                   const float k){

  float val;

  while (true) {
    val = atomicExch(adr, -1.0f);
    if (val > 1.0f){
      break;
    }
    atomicExch(adr, val*ia + k);
  }
}

Which consistently deadlocks on my machine. Even for very simple inputs such as the example data above.

Is it possible to rewrite this function to behave properly?

Example Answer

Assuming k=0.1 and a=0.95, and with the the initial value of args as 0.5 for all indices, the result should be:

[0.5, 0.7139374999999998, 
 0.6462499999999999, 0.575, 0.575, ...]

I calculated these values using Python, they will probably look different in CUDA. This is an example of how the algorithm should behave, not a good sample set to run into the race condition issue.

Reference

Here is a thread where they implement atomicAdd (which already exists for floats at this point) using atomicExch:

https://devtalk.nvidia.com/default/topic/458062/atomicadd-float-float-atomicmul-float-float-/

An example looks like this:

__device__ inline void atomicAdd(float* address, float value) {
  float old = value;  
  float new_old;

  do {
    new_old = atomicExch(address, 0.0f);
    new_old += old;
  }
  while ((old = atomicExch(address, new_old)) != 0.0f);
};

This seems a little easier, and I can't quite see how to adapt it.

Other Solutions

Being able to solve this problem in this way has several advantages for my problem related to memory IO down the road. For this reason I would like to know if this is at all possible.

A possible different approach is to count the number of times each index occurs on the CPU, then perform a "regular" saxpy on the GPU after that. I'm assuming there are other possibilities as well, but I'm still interested in an answer to this particular problem.

Upvotes: 3

Views: 276

Answers (1)

rompetroll
rompetroll

Reputation: 4799

If this was a non-parallel problem, you would simply do this:

*adr = *adr * a + k;

Since there are multiple threads operating on adr, we should read and write with atomic operations though.

float adrValue = atomicExch(adr, -1.0f)
float newValue = adrValue * a + k
atomicExch(adr, newValue)

However, we must be aware of the possibility that another thread has updated adr between our reading step (ln1) and our writing step (ln3).

So our 3-step operation as it is here is non-atomic.

To make it atomic, we should use compare-and-swap (atomicCAS) to ensure we only update memory if its value is unchanged since we read from it. And we can simply repeat our steps, in each iteration with the then-current value in adr as calcucation input, until step3 returns the expected lock-value -1.0f.

do {
    float adrValue = atomicExch(adr, -1.0f)
    float newValue = adrValue * a + k
    adrValue = __int_to_float(atomicCAS(adr, 
                                        __float_as_int(-1.0f),
                                        __float_as_int(newValue)))
} while (adrValue != -1.0f)

ps: consider the above pseudocode

Upvotes: 1

Related Questions