Is a read of 8 bytes on modern intel x86 guaranteed to by sane if 8byte written by different thread?

Question

struct Data {
    double a;
    double b;
    double c;
};

Would the read of each double be sane, if read on a different thread, but only one other thread is writing to each of a,b,c?

What is the scenario if I ensure Data is aligned?

struct Data {double a,b,c; } __attribute__((aligned(64));

This would ensure each of the a,b,c are aligned to 64,64+8, 64+16... so always aligned to 8*8=64 bit boundary.

This question alignment requirements for atomic x86 instructions and it's answer makes me think that it's perfectly valid to write to Data::a/b/c from another thread and simultaneously read them without use of std::atomic.

Yes I know std::atomic would solve this, but that's not the question.

Peter Cordes · Accepted Answer

Yes, aligned 8-byte loads/stores are guaranteed atomic by the x86 ISA, since P5 Pentium. Why is integer assignment on a naturally aligned variable atomic on x86?

But this is C++; there's no guarantee that stores and reloads aren't optimized away. Write in one thread and read in another is C++ Undefined Behaviour; compilers are allowed to assume it doesn't happen, breaking naive assumptions. This lets them keep C++ objects in registers across multiple reads/writes, only eventually storing the final value. (including global variables or memory pointed-to by some pointer.)

Since you didn't already know that volatile or atomic are needed for this reason, better read about the other things that atomic<> does for you, like ordering wrt. other operations unless you use memory_order_relaxed (the default is seq_cst which makes stores expensive, but on x86 loads are still just as cheap). And (like volatile) the assumption that other threads might have modified an object between accesses in this thread. See Can num++ be atomic for 'int num'?, some of which is relevant for FP loads and stores.

Lockless programming in C++ is not simple, unless you have zero need for synchronization / ordering. Then you "just" have to make sure you tell the compiler what you mean, with atomic, or as a hack with double.

Since GCC's std::atomic with mo_relaxed doesn't compile efficiently, you might want to roll your own by making members volatile if you only care about portability. (or even casting to (volatile double*) like the Linux kernel's READ_ONCE / WRITE_ONCE macros). With clang you can just use atomic with memory_order_relaxed and things will compile efficiently. See C++20 std::atomic- std::atomic.specializations for example of what you can do before C++20; C++20 only adds atomic RMW add/sub for double so you don't have to roll your own with a CAS loop.

volatile will probably still defeat auto-vectorization, but you can of course use _mm_load_pd or whatever. (See also Atomic double floating point or SSE/AVX vector load/store on x86_64 - note that SIMD load/store aren't necessarily atomic even if aligned. Also undocumented is whether they're per-element atomic, although that is I think safe to assume. Per-element atomicity of vector load/store and gather/scatter?)

When to use volatile with multi threading? normally never, except maybe as a workaround for GCC which won't emit efficient asm for atomic, and where we know exactly how volatile compiles to asm.

BTW, you only need alignas(8) to make sure members are 8-byte aligned. Aligning the struct to a whole cache line doesn't hurt, unless it wastes space.

For performance: if different threads are using different variables in the same cache line, that's "false sharing" and terrible for performance. Don't group your shared variables together in one struct unless they're usually read or written as a group. Otherwise you definitely want them in separate 64-byte cache lines.

Note that a data-race on a volatile is still ISO C++ undefined-behaviour, but if you're using GNU C (as required by your __attribute__), it's pretty much well defined. The Linux kernel uses it for its own hand-rolled atomic (along with inline asm for barriers) so you can assume it's not going to be intentionally unsupported any time soon.

TL:DR: in GNU C it does more or less work to think of volatile as atomic with mo_relaxed, for aligned objects small enough to be naturally atomic.

Is a read of 8 bytes on modern intel x86 guaranteed to by sane if 8byte written by different thread?

Answers (1)

Related Questions