Anton
Anton

Reputation: 21

C++ atomic variable memory order problem can not reproduce LoadStore reordering example

everyone. I wrote a demo to reproduce the problems referenced in cppreference .

cppreference demo I found that some documents and blogs say this may not reproduce on x86 chips but on ARM chips , because ARM arch is weak memory order. Therefore, I specifically conducted the experiment on the Apple M1 chip (ARM). However, it cannot reproduce r1 == r2 == 42 neither.

If there are any problems with my demo, or if you have any other demos that might reproduce the memory order problem.

#include <atomic>
#include <chrono>
#include <iostream>
#include <mutex>
#include <ostream>
#include <queue>
#include <thread>
#define CHECK(condition)                                                       \
  if (!(condition)) {                                                          \
    std::cerr << "Check failed: " << #condition << std::endl;                  \
    std::abort();                                                              \
  }

class TaskQueue {
public:
  TaskQueue() = default;
  ~TaskQueue() = default;
  void AddTask(const std::function<void()> &task) {
    std::lock_guard<std::mutex> lock(mtx_);
    tasks_.push(task);
  }
  void AddTaskWithCallBack(std::function<void()> task,
                           std::function<void()> callback) {
    std::lock_guard<std::mutex> lock(mtx_);
    tasks_.push([task, callback]() {
      task();
      callback();
    });
  }

  bool Empty() const {
    std::lock_guard<std::mutex> lock(mtx_);
    return tasks_.empty();
  }

  bool GetTask(std::function<void()> &gotten_task) {
    std::lock_guard<std::mutex> lock(mtx_);
    if (tasks_.empty()) {
      return false;
    }
    gotten_task = tasks_.front();
    tasks_.pop();
    return true;
  }

private:
  std::queue<std::function<void()>> tasks_;
  mutable std::mutex mtx_;
};

TaskQueue thread1_queue;
TaskQueue thread2_queue;

void ThreadWork(std::chrono::milliseconds delay, TaskQueue &queue) {
  std::cout << "Launched thread : " << std::this_thread::get_id() << std::endl;
  std::chrono::steady_clock::time_point start =
      std::chrono::steady_clock::now();
  auto end = start + delay;
  std::function<void()> task;
  while (std::chrono::steady_clock::now() < end) {
    if (queue.GetTask(task)) {
      CHECK(task);
      task();
    } else {
      std::this_thread::yield();
    }
  }
}

std::atomic<int> x(-1);
std::atomic<int> y(-2);
int r1 = -3;
int r2 = -4;
std::mutex mtx;

int main() {
  std::cout << "hello world" << std::endl;
  std::thread t2(ThreadWork, std::chrono::milliseconds(20000),
                 std::ref(thread2_queue));
  std::thread t1(ThreadWork, std::chrono::milliseconds(20000),
                 std::ref(thread1_queue));
  for (int i = 0; i < 100; ++i) {
    x.store(-1);
    y.store(-2);
    r1 = -3;
    r2 = -4;
    thread1_queue.AddTask([]() {
      r1 = y.load(std::memory_order_relaxed); // A
      x.store(r1, std::memory_order_relaxed); // B
    });
    thread2_queue.AddTask([]() {
      r2 = x.load(std::memory_order_relaxed); // C
      y.store(42, std::memory_order_relaxed); // D
    });
    std::this_thread::sleep_for(std::chrono::milliseconds(100));
    std::cout << "r1: " << r1 << ", r2: " << r2 << std::endl;
  }
  t1.join();
  t2.join();
}

To avoid the influence of thread scheduling, I build two thread and make them always work then post the experiment code to the TaskQueue.

Upvotes: 2

Views: 87

Answers (2)

Jan Schultke
Jan Schultke

Reputation: 39658

The point of the cppreference example is to demonstrate an unlikely and paradoxical scenario. It's not unusual that you wouldn't observe that behavior in practice. The example is mostly academic and meant to demonstrate what could happen with relaxed memory order in theory:

// Thread 1:
r1 = y.load(std::memory_order_relaxed); // A
x.store(r1, std::memory_order_relaxed); // B
// Thread 2:
r2 = x.load(std::memory_order_relaxed); // C 
y.store(42, std::memory_order_relaxed); // D

Due to all loads and stores being relaxed, it would be permitted for D to effectively happen before C, and to happen before A if Thread 2 runs first. In theory, that would give you r1 == r2 == 42.

In particular, this may occur if D is completed before C in thread 2, either due to compiler reordering or at runtime.

However, compilers have no particular incentive to reorder C and D. The armv8-clang output is as follows:

ldr     w8, [x8, :lo12:x]
mov     w11, #42
str     w8, [x9, :lo12:r2]
str     w11, [x10, :lo12:y]

It's still possible that D happens before C due to hardware, given that there are no memory barriers present here. However, thread 1 and thread 2 do very little work, so it's likely that they simply run those few instructions in order, long before the other thread runs.

Therefore, it would be a huge coincidence if you did observe r1 == r2 == 42, though it's not impossible.

Note: Here is a simpler setup based on busy waiting which also fails to reproduce the issue: https://godbolt.org/z/61Y86o1bn.

Upvotes: 2

Peter Cordes
Peter Cordes

Reputation: 364448

Put x and y in separate cache lines

You have x and y in the same cache line, so it's pretty much impossible for y.store(42, relaxed) (D) to commit and become visible to the other thread while the x.load (C) is still in flight. This core would need MESI exclusive ownership of the cache line to commit the store from the store buffer to L1d cache (making it globally visible, which is the only way for stores to become visible to other cores in ARMv8), which is more than sufficient for the load to complete.

Since they're both global, the load address is known plenty early, not e.g. the result of another load of a pointer. Just delaying the load via the address being available late wouldn't be sufficient, though: the address has to be known for the load to retire from the ROB (Reorder Buffer), because the CPU has to know it won't fault. So the load address has to be ready before the load can retire, and that has to happen before the store can retire and later commit, unless we get compile-time reordering of (C) and (D).

Use alignas(64) std::atomic<int> x(-1), y(-2);. And preferably also align r1 and r2 and the task queues so they can't be in the same cache line as x or y, so you can try to "prime" the cache state independently of saving the results. (Or use a struct with padding to force the layout or x and y, preventing anything else from being in the same cache line without having to align all your other globals. Like struct xy {alignas(64) std::atomic<int> x,y;}; which will have size 128 including padding after y.


Have both threads wake up from the same trigger

If both tasks ran at the same time (or with task2 starting slightly earlier), yes you could see the r1 = r2 = 42 result you're looking for with x and y in separate cache lines.

That timing is not super likely in your setup unless an interrupt or something delays the first thread. When thread1_queue.AddTask finishes, it unlocks the mutex in that queue. Your spin-wait loop makes a slow system call, std::this_thread::yield(); so it might not already be waiting for the mutex when AddTask unlocks it. In that case there will be some time before it notices the mutex is available.

So that leaves it up to chance how close the timing is between the two threads, and you only do one attempt per 100 ms so it's very slow.

Having both threads C++20 .wait() on an atomic go flag and using .notify_all() after adding both tasks would be likely to get both woken up very close to the same time. Or some other method of implementing a simultaneous wakeup, like perhaps a condition variable.


Prime the caches to a MESI state that makes this more likely.

The LoadStore reordering effect you're looking for would also be more likely if thread 2 already had y's cache line in MESI Exclusive state, ready to commit the store, but didn't have access to x's cache line.

After a trial, before .wait for the next trial, you could assign to y in thread 2, or to a volatile int in the same cache line. And have thread 1 assign to x or something else in its cache line.

And/or have a third thread (or a few more threads) hammering away at the cache line containing x, causing extra contention for access to it. But that would also delay thread 1 trying to commit its store to x, which needs to happen before thread 2's load of x eventually completes.

Upvotes: 5

Related Questions