Reputation: 774
Specifically, should I use this template to read the variable in the writer thread for optimal performance?
template <typename T>
inline T load_non_atomic(const std::atomic<T> &v)
{
if (sizeof(std::atomic<T>) == sizeof(T))
return(* reinterpret_cast<const T *>(&v));
else
return(v.load(std::memory_order_relaxed));
}
Upvotes: 2
Views: 120
Reputation: 65046
You can't portably or legally just cast the std::atomic<T>
to a T
object through a reinterpret_cast
on a pointer like you are doing, although you will often find it works in practice.
Other than the UB, the primary downside is that the compiler won't necessarily reload the value each time this method is called, which is probably something you want. You may find that it simply caches the value, breaking assumptions your underlying algorithm makes (e.g., if you checking a flag in a loop, the value may never be seen in change).
In practice, v.load(std::memory_order_relaxed)
is going to generate fast code on most platforms anyway.
For example, the following code to read two std::atomic<int>
compiles almost as well with plan .load()
as your hack:
template
inline T load_cheating(const std::atomic<T> &v) {
return (* reinterpret_cast<const T *>(&v));
}
template <typename T>
inline T load_relaxed(const std::atomic<T> &v) {
return (v.load(std::memory_order_relaxed));
}
int add_two_cheating(const std::atomic<int> &a, const std::atomic<int> &b) {
return load_cheating(a) + load_cheating(b);
}
int add_two_relaxed(const std::atomic<int> &a, const std::atomic<int> &b) {
return load_relaxed(a) + load_relaxed(b);
}
The two versions end up as:
add_two_cheating(std::atomic<int> const&, std::atomic<int> const&):
mov eax, DWORD PTR [rsi]
add eax, DWORD PTR [rdi]
ret
and
add_two_relaxed(std::atomic<int> const&, std::atomic<int> const&):
mov edx, DWORD PTR [rdi]
mov eax, DWORD PTR [rsi]
add eax, edx
ret
These have essentially identical performance1. Perhaps one day the latter will be identical, although for most practical purposes it already is.
Even on ARM, which has a weaker memory model, you pay zero performance cost:
add_two_cheating(std::atomic<int> const&, std::atomic<int> const&):
ldr w2, [x0]
ldr w0, [x1]
add w0, w2, w0
ret
add_two_relaxed(std::atomic<int> const&, std::atomic<int> const&):
ldr w0, [x0]
ldr w1, [x1]
add w0, w1, w0
ret
Identical code produced in both places (the more-or-less-RISC ARM architecture doesn't have load-op instructions so you don't see the slight difference you did on x86).
Note that even on the same thread, once you use a type-punned pointer to read or modify the variables, even the single threaded code can be broken (e.g., reads may ignore earlier writes, or, in some cases, reads can see writes that _happen in the future on the same thread).
Check out the triple_nonatomic
examples on godbolt - they all get the single threaded behavior wrong. I didn't easily make it happen with an intervening std::atomic.store()
type operation, probably because these aren't as optimized today (even relaxed ordering seems to imply a compiler barrier) - but they certainly may be in the future.
On modern x86, the same number of ops in the unfused domain, and likely the same latency, but the first one does have one less uop in the fused domain. We are taking a fraction of a cycle difference on average, if any.
Upvotes: 3
Reputation: 275878
No, what you describe is undefined behavior.
A decent optimizer will reduce an atomic read to a read if doing so is defined behavior. You may not have a decent optimizer, or maybe your defined behavior code is asking a stricter question than you actually need.
If you do this, you are now responsible for auditing the generated assembly, machine code generated, the CPU and memory architecture, in every future compile of your code, across OS revisions, compiler version updates, hardware changes, etc.
So if your code is going to be compiled once, run once, then throw away, what you did is only a ridiculous amount of effort.
If it is going to have a longer lifetime, what you are doing is a nearly immesurable amount of effort to avoid random breaks in the code base at some future date.
Doing this without a lot of evidence it generates faster code (which is not in evidence), that the faster code is correct, and that the speed increase is critical to your problem, would quite simply be stupid.
Upvotes: 2