Reputation: 21
everyone. I wrote a demo to reproduce the problems referenced in cppreference .
cppreference demo I found that some documents and blogs say this may not reproduce on x86 chips but on ARM chips , because ARM arch is weak memory order. Therefore, I specifically conducted the experiment on the Apple M1 chip (ARM). However, it cannot reproduce r1 == r2 == 42 neither.
If there are any problems with my demo, or if you have any other demos that might reproduce the memory order problem.
#include <atomic>
#include <chrono>
#include <iostream>
#include <mutex>
#include <ostream>
#include <queue>
#include <thread>
#define CHECK(condition) \
if (!(condition)) { \
std::cerr << "Check failed: " << #condition << std::endl; \
std::abort(); \
}
class TaskQueue {
public:
TaskQueue() = default;
~TaskQueue() = default;
void AddTask(const std::function<void()> &task) {
std::lock_guard<std::mutex> lock(mtx_);
tasks_.push(task);
}
void AddTaskWithCallBack(std::function<void()> task,
std::function<void()> callback) {
std::lock_guard<std::mutex> lock(mtx_);
tasks_.push([task, callback]() {
task();
callback();
});
}
bool Empty() const {
std::lock_guard<std::mutex> lock(mtx_);
return tasks_.empty();
}
bool GetTask(std::function<void()> &gotten_task) {
std::lock_guard<std::mutex> lock(mtx_);
if (tasks_.empty()) {
return false;
}
gotten_task = tasks_.front();
tasks_.pop();
return true;
}
private:
std::queue<std::function<void()>> tasks_;
mutable std::mutex mtx_;
};
TaskQueue thread1_queue;
TaskQueue thread2_queue;
void ThreadWork(std::chrono::milliseconds delay, TaskQueue &queue) {
std::cout << "Launched thread : " << std::this_thread::get_id() << std::endl;
std::chrono::steady_clock::time_point start =
std::chrono::steady_clock::now();
auto end = start + delay;
std::function<void()> task;
while (std::chrono::steady_clock::now() < end) {
if (queue.GetTask(task)) {
CHECK(task);
task();
} else {
std::this_thread::yield();
}
}
}
std::atomic<int> x(-1);
std::atomic<int> y(-2);
int r1 = -3;
int r2 = -4;
std::mutex mtx;
int main() {
std::cout << "hello world" << std::endl;
std::thread t2(ThreadWork, std::chrono::milliseconds(20000),
std::ref(thread2_queue));
std::thread t1(ThreadWork, std::chrono::milliseconds(20000),
std::ref(thread1_queue));
for (int i = 0; i < 100; ++i) {
x.store(-1);
y.store(-2);
r1 = -3;
r2 = -4;
thread1_queue.AddTask([]() {
r1 = y.load(std::memory_order_relaxed); // A
x.store(r1, std::memory_order_relaxed); // B
});
thread2_queue.AddTask([]() {
r2 = x.load(std::memory_order_relaxed); // C
y.store(42, std::memory_order_relaxed); // D
});
std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::cout << "r1: " << r1 << ", r2: " << r2 << std::endl;
}
t1.join();
t2.join();
}
To avoid the influence of thread scheduling, I build two thread and make them always work then post the experiment code to the TaskQueue.
Upvotes: 2
Views: 87
Reputation: 39658
The point of the cppreference example is to demonstrate an unlikely and paradoxical scenario. It's not unusual that you wouldn't observe that behavior in practice. The example is mostly academic and meant to demonstrate what could happen with relaxed memory order in theory:
// Thread 1: r1 = y.load(std::memory_order_relaxed); // A x.store(r1, std::memory_order_relaxed); // B // Thread 2: r2 = x.load(std::memory_order_relaxed); // C y.store(42, std::memory_order_relaxed); // D
Due to all loads and stores being relaxed, it would be permitted for D to effectively happen before C, and to happen before A if Thread 2 runs first.
In theory, that would give you r1 == r2 == 42
.
In particular, this may occur if D is completed before C in thread 2, either due to compiler reordering or at runtime.
However, compilers have no particular incentive to reorder C and D. The armv8-clang output is as follows:
ldr w8, [x8, :lo12:x]
mov w11, #42
str w8, [x9, :lo12:r2]
str w11, [x10, :lo12:y]
It's still possible that D happens before C due to hardware, given that there are no memory barriers present here. However, thread 1 and thread 2 do very little work, so it's likely that they simply run those few instructions in order, long before the other thread runs.
Therefore, it would be a huge coincidence if you did observe r1 == r2 == 42
, though it's not impossible.
Note: Here is a simpler setup based on busy waiting which also fails to reproduce the issue: https://godbolt.org/z/61Y86o1bn.
Upvotes: 2
Reputation: 364448
x
and y
in separate cache linesYou have x
and y
in the same cache line, so it's pretty much impossible for y.store(42, relaxed)
(D) to commit and become visible to the other thread while the x.load
(C) is still in flight. This core would need MESI exclusive ownership of the cache line to commit the store from the store buffer to L1d cache (making it globally visible, which is the only way for stores to become visible to other cores in ARMv8), which is more than sufficient for the load to complete.
Since they're both global, the load address is known plenty early, not e.g. the result of another load of a pointer. Just delaying the load via the address being available late wouldn't be sufficient, though: the address has to be known for the load to retire from the ROB (Reorder Buffer), because the CPU has to know it won't fault. So the load address has to be ready before the load can retire, and that has to happen before the store can retire and later commit, unless we get compile-time reordering of (C) and (D).
Use alignas(64) std::atomic<int> x(-1), y(-2);
. And preferably also align r1
and r2
and the task queues so they can't be in the same cache line as x or y, so you can try to "prime" the cache state independently of saving the results. (Or use a struct with padding to force the layout or x
and y
, preventing anything else from being in the same cache line without having to align all your other globals. Like struct xy {alignas(64) std::atomic<int> x,y;};
which will have size 128 including padding after y.
If both tasks ran at the same time (or with task2 starting slightly earlier), yes you could see the r1 = r2 = 42
result you're looking for with x and y in separate cache lines.
That timing is not super likely in your setup unless an interrupt or something delays the first thread. When thread1_queue.AddTask
finishes, it unlocks the mutex in that queue. Your spin-wait loop makes a slow system call, std::this_thread::yield();
so it might not already be waiting for the mutex when AddTask unlocks it. In that case there will be some time before it notices the mutex is available.
So that leaves it up to chance how close the timing is between the two threads, and you only do one attempt per 100 ms so it's very slow.
Having both threads C++20 .wait()
on an atomic go
flag and using .notify_all()
after adding both tasks would be likely to get both woken up very close to the same time. Or some other method of implementing a simultaneous wakeup, like perhaps a condition variable.
The LoadStore reordering effect you're looking for would also be more likely if thread 2 already had y
's cache line in MESI Exclusive state, ready to commit the store, but didn't have access to x
's cache line.
After a trial, before .wait
for the next trial, you could assign to y
in thread 2, or to a volatile int
in the same cache line. And have thread 1 assign to x
or something else in its cache line.
And/or have a third thread (or a few more threads) hammering away at the cache line containing x
, causing extra contention for access to it. But that would also delay thread 1 trying to commit its store to x
, which needs to happen before thread 2's load of x
eventually completes.
Upvotes: 5