Reputation: 4183
The code below shows two ways of acquiring shared state via an atomic flag. The reader thread calls poll1()
or poll2()
to check for whether the writer has signaled the flag.
Poll Option #1:
bool poll1() {
return (flag.load(std::memory_order_acquire) == 1);
}
Poll Option #2:
bool poll2() {
int snapshot = flag.load(std::memory_order_relaxed);
if (snapshot == 1) {
std::atomic_thread_fence(std::memory_order_acquire);
return true;
}
return false;
}
Note that option #1 was presented in an earlier question, and option #2 is similar to example code at cppreference.com.
Assuming that the reader agrees to only examine the shared state if the poll
function returns true
, are the two poll
functions both correct and equivalent?
Does option #2 have a standard name?
What are the benefits and drawbacks of each option?
Is option #2 likely to be more efficient in practice? Is it possible for it to be less efficient?
Here is a full working example:
#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>
int x; // regular variable, could be a complex data structure
std::atomic<int> flag { 0 };
void writer_thread() {
x = 42;
// release value x to reader thread
flag.store(1, std::memory_order_release);
}
bool poll1() {
return (flag.load(std::memory_order_acquire) == 1);
}
bool poll2() {
int snapshot = flag.load(std::memory_order_relaxed);
if (snapshot == 1) {
std::atomic_thread_fence(std::memory_order_acquire);
return true;
}
return false;
}
int main() {
x = 0;
std::thread t(writer_thread);
// "reader thread" ...
// sleep-wait is just for the test.
// production code calls poll() at specific points
while (!poll2()) // poll1() or poll2() here
std::this_thread::sleep_for(std::chrono::milliseconds(50));
std::cout << x << std::endl;
t.join();
}
Upvotes: 4
Views: 298
Reputation: 98816
I think I can answer most of your questions.
Both options are certainly correct, but they are not quite equivalent, due to the slightly broader applicability of stand-alone fences (they are equivalent in terms of what you want to accomplish, but the stand-alone fence could technically apply to other things as well -- imagine if this code is inlined). An example of how a stand-alone fence is different from a store/fetch fence is explained in this post by Jeff Preshing.
The check-then-fence pattern in option #2 does not have a name as far as I know. It's not uncommon, though.
In terms of performance, with my g++ 4.8.1 on x64 (Linux) the assembly generated by both options boils down to a single load instruction. This is hardly surprising given that x86(-64) loads and stores all have acquire and release semantics at the hardware level anyway (x86 is known for its quite strong memory model).
For ARM, though, where memory barriers compile down to actual individual instructions, the following output is produced (using gcc.godbolt.com with -O3 -DNDEBUG
):
For while (!poll1());
:
.L25:
ldr r0, [r2]
movw r3, #:lower16:.LANCHOR0
dmb sy
movt r3, #:upper16:.LANCHOR0
cmp r0, #1
bne .L25
For while (!poll2());
:
.L29:
ldr r0, [r2]
movw r3, #:lower16:.LANCHOR0
movt r3, #:upper16:.LANCHOR0
cmp r0, #1
bne .L29
dmb sy
You can see that the only difference is where the synchronization instruction (dmb
) is placed -- inside the loop for poll1
, and after it for poll2
. So poll2
really is more efficient in this real-world case :-) (But read further on for why this might not matter if they're called in a loop to block until the flag changes.)
For ARM64, the output is different, because there are special load/store instructions that have barriers built-in (ldar
-> load-acquire).
For while (!poll1());
:
.L16:
ldar w0, [x1]
cmp w0, 1
bne .L16
For while (!poll2());
:
.L24:
ldr w0, [x1]
cmp w0, 1
bne .L24
dmb ishld
Again, poll2
leads to a loop with no barriers within it, and one outside, whereas poll1
does a barrier each time through.
Now, which one is actually more performant requires running a benchmark, and unfortunately I don't have the setup for that. poll1
and poll2
, counter-intuitively, may end up being equally efficient in this case, since spending extra time waiting for memory effects to propagate within the loop may not actually waste time if the flag variable is one of those effects that needs to propagate anyway (i.e. the total time taken until the loop exits may be the same even if individual (inlined) calls to poll1
take longer than those to poll2
). Of course, this is assuming a loop waiting for the flag to change -- individual calls to poll1
do require more work than individual calls to poll2
.
So, I think overall it's fairly safe to say that poll2
should never be significantly less efficient than poll1
and can often be faster, as long as the compiler can eliminate the branch when it's inlined (which seems to be the case for at least these three popular architectures).
My (slightly different) test code for reference:
#include <atomic>
#include <thread>
#include <cstdio>
int sharedState;
std::atomic<int> flag(0);
bool poll1() {
return (flag.load(std::memory_order_acquire) == 1);
}
bool poll2() {
int snapshot = flag.load(std::memory_order_relaxed);
if (snapshot == 1) {
std::atomic_thread_fence(std::memory_order_acquire);
return true;
}
return false;
}
void __attribute__((noinline)) threadFunc()
{
while (!poll2());
std::printf("%d\n", sharedState);
}
int main(int argc, char** argv)
{
std::thread t(threadFunc);
sharedState = argc;
flag.store(1, std::memory_order_release);
t.join();
return 0;
}
Upvotes: 2