Atomic Saxpy in CUDA

Question

I have the following problem in CUDA.

Let's assume we have a list of indices where some, or all, indices can be present more than one time:

inds = [1, 1, 1, 2, 2, 3, 4]

With these indices I would like to perform an atomic saxpy operation (in parallel) on a float array, x. I'm not worried about the order in which the operations are applied. That is, I want to do this, for floats a and k:

x[i] = x[i]*a + k;

This would be trivial if there were no duplicate indices in inds.

My current solution (that does not work) is this:

// assume all values in adr are greater than or equal to 0.
// also assume a and k are strictly positive.

__device__ inline void atomicSaxpy(float *adr,
                                   const float a,
                                   const float k){

  float old = atomicExch(adr, -1.0f); // first exchange
  float new_;
  if (old <= -1.0f){
    new_ = -1.0f;
  } else {
    new_ = old*a + k;
  }

  while (true) {
    old = atomicExch(adr, new_); // second exchange
    if (old <= -1.0f){
      break;
    }
    new_ = old*a + k;
  }
}

This seems to return the correct answer in many cases.

Here is what I think happens when you do not get the right answer:

old gets a value of -1.0f in the first exchange. => new_ = -1.0f
old gets a value of -1.0f in the second exchange as well.
The function exits without having had any external effect at all.

A somewhat different approach is this:

__device__ inline void atomicSaxpy(float *adr,
                                   const float ia,
                                   const float k){

  float val;

  while (true) {
    val = atomicExch(adr, -1.0f);
    if (val > 1.0f){
      break;
    }
    atomicExch(adr, val*ia + k);
  }
}

Which consistently deadlocks on my machine. Even for very simple inputs such as the example data above.

Is it possible to rewrite this function to behave properly?

Example Answer

Assuming k=0.1 and a=0.95, and with the the initial value of args as 0.5 for all indices, the result should be:

[0.5, 0.7139374999999998, 
 0.6462499999999999, 0.575, 0.575, ...]

I calculated these values using Python, they will probably look different in CUDA. This is an example of how the algorithm should behave, not a good sample set to run into the race condition issue.

Reference

Here is a thread where they implement atomicAdd (which already exists for floats at this point) using atomicExch:

https://devtalk.nvidia.com/default/topic/458062/atomicadd-float-float-atomicmul-float-float-/

An example looks like this:

__device__ inline void atomicAdd(float* address, float value) {
  float old = value;  
  float new_old;

  do {
    new_old = atomicExch(address, 0.0f);
    new_old += old;
  }
  while ((old = atomicExch(address, new_old)) != 0.0f);
};

This seems a little easier, and I can't quite see how to adapt it.

Atomic Saxpy in CUDA

Example Answer

Reference

Other Solutions

Answers (1)

Related Questions