Ross Bencina
Ross Bencina

Reputation: 4183

Benefits & drawbacks of as-needed conditional std::atomic_thread_fence acquire?

The code below shows two ways of acquiring shared state via an atomic flag. The reader thread calls poll1() or poll2() to check for whether the writer has signaled the flag.

Poll Option #1:

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

Poll Option #2:

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
    }
    return false;
}

Note that option #1 was presented in an earlier question, and option #2 is similar to example code at cppreference.com.

Assuming that the reader agrees to only examine the shared state if the poll function returns true, are the two poll functions both correct and equivalent?

Does option #2 have a standard name?

What are the benefits and drawbacks of each option?

Is option #2 likely to be more efficient in practice? Is it possible for it to be less efficient?

Here is a full working example:

#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>

int x; // regular variable, could be a complex data structure

std::atomic<int> flag { 0 };

void writer_thread() {
    x = 42;
    // release value x to reader thread
    flag.store(1, std::memory_order_release);
}

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
    }
    return false;
}

int main() {
    x = 0;

    std::thread t(writer_thread);

    // "reader thread" ...  
    // sleep-wait is just for the test.
    // production code calls poll() at specific points

    while (!poll2()) // poll1() or poll2() here
      std::this_thread::sleep_for(std::chrono::milliseconds(50));

    std::cout << x << std::endl;

    t.join();
}

Upvotes: 4

Views: 298

Answers (1)

Cameron
Cameron

Reputation: 98816

I think I can answer most of your questions.

Both options are certainly correct, but they are not quite equivalent, due to the slightly broader applicability of stand-alone fences (they are equivalent in terms of what you want to accomplish, but the stand-alone fence could technically apply to other things as well -- imagine if this code is inlined). An example of how a stand-alone fence is different from a store/fetch fence is explained in this post by Jeff Preshing.

The check-then-fence pattern in option #2 does not have a name as far as I know. It's not uncommon, though.

In terms of performance, with my g++ 4.8.1 on x64 (Linux) the assembly generated by both options boils down to a single load instruction. This is hardly surprising given that x86(-64) loads and stores all have acquire and release semantics at the hardware level anyway (x86 is known for its quite strong memory model).

For ARM, though, where memory barriers compile down to actual individual instructions, the following output is produced (using gcc.godbolt.com with -O3 -DNDEBUG):

For while (!poll1());:

.L25:
    ldr     r0, [r2]
    movw    r3, #:lower16:.LANCHOR0
    dmb     sy
    movt    r3, #:upper16:.LANCHOR0
    cmp     r0, #1
    bne     .L25

For while (!poll2());:

.L29:
    ldr     r0, [r2]
    movw    r3, #:lower16:.LANCHOR0
    movt    r3, #:upper16:.LANCHOR0
    cmp     r0, #1
    bne     .L29
    dmb     sy

You can see that the only difference is where the synchronization instruction (dmb) is placed -- inside the loop for poll1, and after it for poll2. So poll2 really is more efficient in this real-world case :-) (But read further on for why this might not matter if they're called in a loop to block until the flag changes.)

For ARM64, the output is different, because there are special load/store instructions that have barriers built-in (ldar -> load-acquire).

For while (!poll1());:

.L16:
    ldar    w0, [x1]
    cmp     w0, 1
    bne     .L16

For while (!poll2());:

.L24:
    ldr     w0, [x1]
    cmp     w0, 1
    bne     .L24
    dmb     ishld

Again, poll2 leads to a loop with no barriers within it, and one outside, whereas poll1 does a barrier each time through.

Now, which one is actually more performant requires running a benchmark, and unfortunately I don't have the setup for that. poll1 and poll2, counter-intuitively, may end up being equally efficient in this case, since spending extra time waiting for memory effects to propagate within the loop may not actually waste time if the flag variable is one of those effects that needs to propagate anyway (i.e. the total time taken until the loop exits may be the same even if individual (inlined) calls to poll1 take longer than those to poll2). Of course, this is assuming a loop waiting for the flag to change -- individual calls to poll1 do require more work than individual calls to poll2.

So, I think overall it's fairly safe to say that poll2 should never be significantly less efficient than poll1 and can often be faster, as long as the compiler can eliminate the branch when it's inlined (which seems to be the case for at least these three popular architectures).

My (slightly different) test code for reference:

#include <atomic>
#include <thread>
#include <cstdio>

int sharedState;
std::atomic<int> flag(0);

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
    }
    return false;
}

void __attribute__((noinline)) threadFunc()
{
    while (!poll2());
    std::printf("%d\n", sharedState);
}

int main(int argc, char** argv)
{
    std::thread t(threadFunc);
    sharedState = argc;
    flag.store(1, std::memory_order_release);
    t.join();
    return 0;
}

Upvotes: 2

Related Questions