Total Store Order impact on the program

Question

I'm reading the memory consistency model materials and I came across two program examples. I'm not sure if my understanding is correct and the underlining reason.

The general question is: Can the data in the function call use() be used as 0?

Program 1

int data = 0, ready = 0;
void p1 (void *ignored) {
    data = 2000;
    ready = 1;
}
void p2 (void *ignored) {
    while (!ready)
        ;
    use (data);
}

I think data has to be 2000 when it's used in p2() because data and ready has store ordering in p1().

Program 2

int a = 0, b = 0;
void p1 (void *ignored) { a = 1; }
void p2 (void *ignored) {
    if (a == 1)
        b = 1;
}
void p3 (void *ignored) {
    if (b == 1)
        use (a);
}

I think a has to be used as 1 in p3() because in p3(), a won't be used unless b == 1; in p2(), b won't be stored unless a == 1. So a has to be 1 when a is used in p2.

Is my understanding correct?

I'm considering Intel Haswell processor with 3 Level of cache. Let's consider two situations: NUMA and UMA.

Right, I can create a multi-threaded program to test it, but I would prefer to understand the principles why it works and why it does not in theory, so that I can understand the secret behind the fact. :-D

[Another answer] If we consider the read prefetch in Intel processor and the cache consistency model, it's possible that one thread may prefetch the variable a from its private cache before data is stored as 1 on another core and marked as invalid via the cache controller. In this case, both programs can use the variable data as 1. It could be the same situation under both UMA and NUMA model.

Thank you very much for your help!

Peter Cordes · Accepted Answer

Program 1

If that's literal C, not pseudocode, then:

The compiler is free to reorder the stores in p1 at compile time.
The compiler is free to hoist the load out of while (!ready); because ready isn't volatile. (So the loop always runs zero or infinity times.) The normal way would be to use C11 atomic_load_explicit(&ready, memory_order_acquire). (On x86, every load is a load-acquire, so it's free. Writing it that way instead of just *(volatile int*)&ready would make your code portable to any C11 implementation, regardless of architecture.)

You're making the mistake of thinking that a C implementation targeting a strongly-ordered ISA has strong ordering at a source level. C programs target the C abstract machine. compilers make executable code that produces results as if it was literally running the C source code on the abstract machine, with the abstract machine's memory ordering rules. See that link to Jeff Preshing's blog in the prev paragraph.

Program 2

As long as the load in p3 is an acquire-load, then yes, your reasoning is sound. (On x86, this happens for free, and with code like that it's unlikely that compile-time speculative reordering could produce code that behaves differently. It's possible, though: value speculation is allowed in general.)

I'm not sure if the b=1 store in p2 needs to be a release-store. I think so, otherwise on a weakly ordered system it could become globally visible before the load that found a==1. (Again, this is free on x86.)

I'm considering Intel Haswell processor with 3 Level of cache. Let's consider two situations: NUMA and UMA.

NUMA doesn't affect the ISA's ordering guarantees. It may make reordering more likely, or possible in ways that don't happen in practice on existing single-core CPUs. (Although note that hyperthreading is a sort of NUMA, because threads sharing the same logical core see each others memory accesses much more quickly than other cores).

Code that breaks on a NUMA system is broken, period, and shouldn't be trusted on any system.

If you're writing new code, please use C11 atomics. You need something to prevent compile-time reordering / hoisting, and C11 stdatomic, or the equivalent C++11 std::atomic, is the modern way to do it.

Not only will your code avoid any compiler-specific stuff for compiler-barriers (to prevent reordering), your code will be self-documenting in terms of what memory-ordering requirements it actually depends on. It will even be portable to ARM or any other architecture, because it explicitly uses acquire-loads where needed, and release-stores where needed.

The default ordering for atomic types is memory_order_seq_cst, though, so you will often need the explicit-ordering version of functions that include a store, to stop them from emitting instructions for a full memory barrier when you don't need it (mfence on x86). x86 atomic read-modify-write always have to use a lock prefix, so on x86 there's no benefit to weaker orderings than mo_seq_cst, but it doesn't hurt to use the weakest ordering that makes your algorithm correct. (except that you can't test on x86 hardware to see if you used too weak an ordering).

e.g. my_var = 1 will compile to mov [my_var], 1 / mfence, so you have to use atomic_store_explicit( &my_var, 1, memory_order_release ) to have it compile to just a normal x86 store.

See a simple/naive semaphore (counting lock) implementation using C11 atomics for example.

Total Store Order impact on the program

Answers (1)

Related Questions