I was looking at the compiler output of rmw atomics from gcc and noticed something odd - on Aarch64, rmw operations such as fetch_add can be partially reordered with relaxed loads. On Aarch64, the following code may be generated for value.fetch_add(1, seq_cst) .L1: ldaxr x1, [x0] add x1, x1, 1 stlxr w2, x1, [x0] cbnz L1 However, it's possible for loads and stores that happen prior to ldaxr to be reordered past the load and loads/stores that happen after the stlxr (see here ). GCC doesn't add fences to prevent this - Here's a small piece of code demonstrating this: void partial_reorder(std::atomic<uint64_t> loader, std::atomic<uint64_t> adder) { loader.load(std::memory_order_relaxed); // can be reordered past the ldaxr adder.fetch_add(1, std::memory_order_seq_cst); loader.load(std::memory_order_relaxed); // can be reordered past the stlxr } generating partial_reorder(std::atomic<int>, std::atomic<int>): ldr w2, [x0] @ reordered down .L2: ldaxr w2, [x1] add w2, w2, 1 stlxr w3, w2, [x1] cbnz w3, .L2 ldr w0, [x0] @ reordered up ret In effect, the loads can be partially reordered with the RMW operation - they occur in the middle of it. So, what's the big deal? What am I asking? It seems strange that an atomic operation is divisible as such. I couldn't find anything in the standard preventing this, but I had believed that there was a combination of rules that implied operations are indivisible. It seems like this doesn't respect acquire ordering. If I perform a load directly after this operation, I could see store-load or store-store reordering between the fetch_add and the later operation, meaning that the later memory access is at least partially reordered behind the acquire operation. Again, I couldn't find anything in the standards explicitly saying that isn't allowed and acquire is load ordering, but my understanding was that the acquire operation applied to the entirety of the operation and not just parts of it. A similar scenario can apply to release where something is reordered past the ldaxr. This one is may be stretching the ordering definitions a bit more, but it seems invalid that two operations before and after a seq_cst operation can be reordered past each other. This could(?) happen if the bordering operations each reorder into the middle of the operation, and then go past each other.

Reputation: 101

Partial reordering of C++11 atomics on Aarch64

I was looking at the compiler output of rmw atomics from gcc and noticed something odd - on Aarch64, rmw operations such as fetch_add can be partially reordered with relaxed loads.

On Aarch64, the following code may be generated for value.fetch_add(1, seq_cst)

.L1:
    ldaxr x1, [x0]
    add x1, x1, 1
    stlxr w2, x1, [x0]
    cbnz L1

However, it's possible for loads and stores that happen prior to ldaxr to be reordered past the load and loads/stores that happen after the stlxr (see here). GCC doesn't add fences to prevent this - Here's a small piece of code demonstrating this:

void partial_reorder(std::atomic<uint64_t> loader, std::atomic<uint64_t> adder) {
    loader.load(std::memory_order_relaxed); // can be reordered past the ldaxr
    adder.fetch_add(1, std::memory_order_seq_cst);
    loader.load(std::memory_order_relaxed); // can be reordered past the stlxr
}

generating

partial_reorder(std::atomic<int>, std::atomic<int>):
    ldr     w2, [x0] @ reordered down
.L2:
    ldaxr   w2, [x1]
    add     w2, w2, 1
    stlxr   w3, w2, [x1]
    cbnz    w3, .L2
    ldr     w0, [x0] @ reordered up
    ret

In effect, the loads can be partially reordered with the RMW operation - they occur in the middle of it.

So, what's the big deal? What am I asking?

It seems strange that an atomic operation is divisible as such. I couldn't find anything in the standard preventing this, but I had believed that there was a combination of rules that implied operations are indivisible.
It seems like this doesn't respect acquire ordering. If I perform a load directly after this operation, I could see store-load or store-store reordering between the fetch_add and the later operation, meaning that the later memory access is at least partially reordered behind the acquire operation. Again, I couldn't find anything in the standards explicitly saying that isn't allowed and acquire is load ordering, but my understanding was that the acquire operation applied to the entirety of the operation and not just parts of it. A similar scenario can apply to release where something is reordered past the ldaxr.
This one is may be stretching the ordering definitions a bit more, but it seems invalid that two operations before and after a seq_cst operation can be reordered past each other. This could(?) happen if the bordering operations each reorder into the middle of the operation, and then go past each other.

Upvotes: 5

Answers (2)

Nate Eldredge

Reputation: 58132

I asked the same question in For purposes of ordering, is atomic read-modify-write one operation or two?, not knowing that it was a duplicate.

You're right that this means another load or store can be reordered "into the middle" of an atomic RMW. I don't think this is a bug, though.

Since nearly all of the C++ memory model is defined in terms of loads and stores, I believe (others may disagree) that we must treat an atomic read-modify-write as a pair consisting of one load and one store. Its "atomic" nature comes from [atomics.order p10] (in C++20), that the load must see the value that immediately precedes, in the modification order, the value written by the store.

Effectively, this means that no other accesses to loader itself can occur between the read and the write. But accesses to other variables are fair game, limited only by the barriers. Acquire ordering doesn't forbid the load of the RMW from being reordered with an earlier relaxed operation, so such reordering is legitimate.

If your code needs to avoid such reordering, then you have to strengthen your barriers: the first loader.load() needs to be acquire or stronger, and the second one needs to be seq_cst.

Upvotes: 1

Tsyvarev

Reputation: 65928

Looks like you are right. At least, very similar bug for gcc has been accepted and fixed.

They provide this code:

.L2:
    ldaxr   w1, [x0]       ; load-acquire (__sync_fetch_and_add)
    add w1, w1, 1
    stlxr   w2, w1, [x0]   ; store-release  (__sync_fetch_and_add)
    cbnz    w2, .L2

So previous operations can be reordered with ldaxr and futher operations can be reordered with stlxr, which breaks C++11 confirmance. Documentation for barriers on aarch64 clearly explains, that such reordering is possible.

Upvotes: 5

Partial reordering of C++11 atomics on Aarch64

Answers (2)

Related Questions