Do you need to pair memory_order_acquire and memory_order_release around a block of code?

Question

I'm making a profiler for an interpreter, I need the interpreter to write the current frame position somewhere on every call. Then sample that information every X ms. I initially started with rigtorp's spinlock around the frame position, but that had quite an effect on the runtime of the interpreter (profiling pointed at the locking acquire time for every loop through the interpreter). So after reading quite some pages about memory fences I came up with a more efficient solution, but I would like to know if this is the correct interpretation of the relation between memory_order_relaxed & acquire/release.

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

using namespace std;
 
typedef struct frame {
    uint8_t op;
    uint16_t arg;
    uint32_t check;
} frame;


static constexpr unsigned int FRAME_BUFFER_SIZE = 8 * 1024;

static atomic index;
static frame frames[FRAME_BUFFER_SIZE];

static void writer() {
    uint8_t op = 0;
    uint16_t arg = 0;
    for (;;) {
        op++;
        arg++;
        const auto newIndex = index.load(memory_order_relaxed) + 1;
        auto &target = frames[newIndex % FRAME_BUFFER_SIZE];
        target.op = op;
        target.arg = arg;
        target.check = static_cast(arg) + op;
        index.store(newIndex, memory_order_release);
        _mm_pause(); // give the other threads some time
    }
}

static void reader() {
    for (;;) {
        const auto lastValidIndex = index.load(memory_order_acquire);
        // we race, hoping that the FRAME_BUFFER_SIZE is enough room 
        // to avoid writter catching up to us
        const auto snapshot = frames[lastValidIndex % FRAME_BUFFER_SIZE];
        if ((static_cast(snapshot.arg) + snapshot.op) != snapshot.check) {
            cout << "Invalid snapshot
";
            exit(1);
        }
        // we sleep a bit, since the reader is only intendede to read once in a while
        this_thread::sleep_for(chrono::milliseconds(1)); 

    }
}

int main() {
    cout << "Starting race
";
    index = 0;
    memset(frames, 0, sizeof(frames));
    thread w(writer);
    thread r(reader);
    w.join();
    r.join();
    return 0;
}

So the strategy is as follows, we have a circular buffer, where the writer is writing into, it's the only one that mutates the index variable, so the first read is memory_order_relaxed. Then we update the value in the array, and then we store the new "full" index, this time with a memory_order_release. The reader only reads the index (this case with a memory_order_acquire) and then indexes of that location in the array.

So my questions are:

does this pairing of fences guarantee that the writes to frames array happen before the update of the index?
is the cpu "cache" of the fences cleared in the read thread every time we do a memory_order_acquire?
is it indeed safe to do a memory_order_relaxed read of the index variable, since we know that our thread is the only one reading this, and we don't care about the values in the frames array?

Do you need to pair memory_order_acquire and memory_order_release around a block of code?

Answers (0)

Related Questions