Reputation: 774

Optimally, if a variable is read in many threads but only written in one, should it be read non-atomically in the writing thread?

Specifically, should I use this template to read the variable in the writer thread for optimal performance?

template <typename T>
inline T load_non_atomic(const std::atomic<T> &v)
  {
    if (sizeof(std::atomic<T>) == sizeof(T))
      return(* reinterpret_cast<const T *>(&v));
    else
      return(v.load(std::memory_order_relaxed));
  }

Upvotes: 2

Answers (2)

BeeOnRope

Reputation: 65046

You can't portably or legally just cast the std::atomic<T> to a T object through a reinterpret_cast on a pointer like you are doing, although you will often find it works in practice.

Other than the UB, the primary downside is that the compiler won't necessarily reload the value each time this method is called, which is probably something you want. You may find that it simply caches the value, breaking assumptions your underlying algorithm makes (e.g., if you checking a flag in a loop, the value may never be seen in change).

In practice, v.load(std::memory_order_relaxed) is going to generate fast code on most platforms anyway.

For example, the following code to read two std::atomic<int> compiles almost as well with plan .load() as your hack:

template

inline T load_cheating(const std::atomic<T> &v) {
  return (* reinterpret_cast<const T *>(&v));
}

template <typename T>
inline T load_relaxed(const std::atomic<T> &v) {
  return (v.load(std::memory_order_relaxed));
}

int add_two_cheating(const std::atomic<int> &a, const std::atomic<int> &b) {
  return load_cheating(a) + load_cheating(b);
}

int add_two_relaxed(const std::atomic<int> &a, const std::atomic<int> &b) {
  return load_relaxed(a) + load_relaxed(b);
}

The two versions end up as:

add_two_cheating(std::atomic<int> const&, std::atomic<int> const&):
        mov     eax, DWORD PTR [rsi]
        add     eax, DWORD PTR [rdi]
        ret

and

add_two_relaxed(std::atomic<int> const&, std::atomic<int> const&):
        mov     edx, DWORD PTR [rdi]
        mov     eax, DWORD PTR [rsi]
        add     eax, edx
        ret

These have essentially identical performance¹. Perhaps one day the latter will be identical, although for most practical purposes it already is.

Even on ARM, which has a weaker memory model, you pay zero performance cost:

add_two_cheating(std::atomic<int> const&, std::atomic<int> const&):
        ldr     w2, [x0]
        ldr     w0, [x1]
        add     w0, w2, w0
        ret

add_two_relaxed(std::atomic<int> const&, std::atomic<int> const&):
        ldr     w0, [x0]
        ldr     w1, [x1]
        add     w0, w1, w0
        ret

Identical code produced in both places (the more-or-less-RISC ARM architecture doesn't have load-op instructions so you don't see the slight difference you did on x86).

Note that even on the same thread, once you use a type-punned pointer to read or modify the variables, even the single threaded code can be broken (e.g., reads may ignore earlier writes, or, in some cases, reads can see writes that _happen in the future on the same thread).

Check out the triple_nonatomic examples on godbolt - they all get the single threaded behavior wrong. I didn't easily make it happen with an intervening std::atomic.store() type operation, probably because these aren't as optimized today (even relaxed ordering seems to imply a compiler barrier) - but they certainly may be in the future.

On modern x86, the same number of ops in the unfused domain, and likely the same latency, but the first one does have one less uop in the fused domain. We are taking a fraction of a cycle difference on average, if any.

Upvotes: 3

Yakk - Adam Nevraumont

Reputation: 275878

No, what you describe is undefined behavior.

A decent optimizer will reduce an atomic read to a read if doing so is defined behavior. You may not have a decent optimizer, or maybe your defined behavior code is asking a stricter question than you actually need.

If you do this, you are now responsible for auditing the generated assembly, machine code generated, the CPU and memory architecture, in every future compile of your code, across OS revisions, compiler version updates, hardware changes, etc.

So if your code is going to be compiled once, run once, then throw away, what you did is only a ridiculous amount of effort.

If it is going to have a longer lifetime, what you are doing is a nearly immesurable amount of effort to avoid random breaks in the code base at some future date.

Doing this without a lot of evidence it generates faster code (which is not in evidence), that the faster code is correct, and that the speed increase is critical to your problem, would quite simply be stupid.

Upvotes: 2

Optimally, if a variable is read in many threads but only written in one, should it be read non-atomically in the writing thread?

Answers (2)

Related Questions