kirill
kirill

Reputation: 61

How to have atomic load in CUDA

My question is how I can have atomic load in CUDA. Atomic exchange can emulate atomic store. Can atomic load be emulated non-expensively in a similar manner? I can use an atomic add with 0 to load the content atomically but I think it is expensive because it does an atomic read-modify-write instead of only a read.

Upvotes: 6

Views: 2783

Answers (2)

Allison Piper
Allison Piper

Reputation: 81

In addition to using volatile as recommended in the other answer, using __threadfence appropriately is also required to get an atomic load with safe memory ordering.

While some of the comments are saying to just use a normal read because it cannot tear, that is not the same as an atomic load. There's more to atomics than just tearing:

A normal read may reuse a previous load that's already in a register, and thus may not reflect changes made by other SMs with the desired memory ordering. For instance, int *flag = ...; while (*flag) { ... } may only read flag once and reuse this value for every iteration of the loop. If you're waiting for another thread to change the flag's value, you'll never observe the change. The volatile modifier ensures that the value is actually read from memory on every access. See the CUDA documentation on volatile for more info.

Additionally, you'll need to use a memory fence to enforce the correct memory ordering in the calling thread. Without a fence, you get "relaxed" semantics in C++11 parlance, and this can be unsafe when using an atomic for communication.

For example, say your code (non-atomically) writes some large data to memory and then uses a normal write to set an atomic flag to indicate that the data has been written. The instructions may be reordered, hardware cachelines may not be flushed prior to setting the flag, etc etc. The result is that these operations are not guaranteed to be executed in any order, and other threads may not observe these events in the order you expect: The write to the flag is permitted to be happen before the guarded data is written.

Meanwhile, if the reading thread is also using normal reads to check the flag before conditionally loading the data, there will be a race at the hardware level. Out-of-order and/or speculative execution may load the data before the flag's read is completed. The speculatively loaded data is then used, which may not be valid since it was loaded prior to the flag's read.

Well-placed memory fences prevent these sorts of issues by enforcing that instruction reordering will not affect your desired memory ordering and that previous writes are made visible to other threads. __threadfence() and friends are also covered in the CUDA docs.

Putting all of this together, writing your own atomic load method in CUDA looks something like:

// addr must be aligned properly.
__device__ unsigned int atomicLoad(const unsigned int *addr)
{
  const volatile unsigned int *vaddr = addr; // volatile to bypass cache
  __threadfence(); // for seq_cst loads. Remove for acquire semantics.
  const unsigned int value = *vaddr;
  // fence to ensure that dependent reads are correctly ordered
  __threadfence(); 
  return value; 
}

// addr must be aligned properly.
__device__ void atomicStore(unsigned int *addr, unsigned int value)
{
  volatile unsigned int *vaddr = addr; // volatile to bypass cache
  // fence to ensure that previous non-atomic stores are visible to other threads
  __threadfence(); 
  *vaddr = value;
}

This can be written similarly for other non-tearing load/store sizes.

From talking with some NVIDIA devs who work on CUDA atomics, it looks like we should start seeing better support for atomics in CUDA, and the PTX already contains load/store instructions with acquire/release memory ordering semantics -- but there is no way to access them currently without resorting to inline PTX. They're hoping to add them in sometime this year. Once those are in place, a full std::atomic implementation shouldn't be far behind.

Upvotes: 4

user703016
user703016

Reputation: 37995

To the best of my knowledge, there is currently no way of requesting an atomic load in CUDA, and that would be a great feature to have.

There are two quasi-alternatives, with their advantages and drawbacks:

  1. Use a no-op atomic read-modify-write as you suggest. I have provided a similar answer in the past. Guaranteed atomicity and memory consistency but you pay the cost of a needless write.

  2. In practice, the second closest thing to an atomic load could be marking a variable volatile, although strictly speaking the semantics are completely different. The language does not guarantee atomicity of the load (for example, you may in theory get a torn read), but you are guaranteed to get the most up-to-date value. But in practice, as indicated in the comments by @Robert Crovella, it is impossible to get a torn read for properly-aligned transactions of at most 32 bytes, which does make them atomic.

Solution 2 is kind of hacky and I do not recommend it, but it is currently the only write-less alternative to 1. The ideal solution would be to add a way to express atomic loads directly in the language.

Upvotes: 2

Related Questions