Reputation: 15438
I'm making a profiler for an interpreter, I need the interpreter to write the current frame position somewhere on every call. Then sample that information every X ms. I initially started with rigtorp's spinlock around the frame position, but that had quite an effect on the runtime of the interpreter (profiling pointed at the locking acquire time for every loop through the interpreter). So after reading quite some pages about memory fences I came up with a more efficient solution, but I would like to know if this is the correct interpretation of the relation between memory_order_relaxed & acquire/release.
#include <memory>
#include <chrono>
#include <string>
#include <iostream>
#include <thread>
#include <immintrin.h>
#include <cstring>
#include <atomic>
using namespace std;
typedef struct frame {
uint8_t op;
uint16_t arg;
uint32_t check;
} frame;
static constexpr unsigned int FRAME_BUFFER_SIZE = 8 * 1024;
static atomic<unsigned int> index;
static frame frames[FRAME_BUFFER_SIZE];
static void writer() {
uint8_t op = 0;
uint16_t arg = 0;
for (;;) {
op++;
arg++;
const auto newIndex = index.load(memory_order_relaxed) + 1;
auto &target = frames[newIndex % FRAME_BUFFER_SIZE];
target.op = op;
target.arg = arg;
target.check = static_cast<uint32_t>(arg) + op;
index.store(newIndex, memory_order_release);
_mm_pause(); // give the other threads some time
}
}
static void reader() {
for (;;) {
const auto lastValidIndex = index.load(memory_order_acquire);
// we race, hoping that the FRAME_BUFFER_SIZE is enough room
// to avoid writter catching up to us
const auto snapshot = frames[lastValidIndex % FRAME_BUFFER_SIZE];
if ((static_cast<uint32_t>(snapshot.arg) + snapshot.op) != snapshot.check) {
cout << "Invalid snapshot\n";
exit(1);
}
// we sleep a bit, since the reader is only intendede to read once in a while
this_thread::sleep_for(chrono::milliseconds(1));
}
}
int main() {
cout << "Starting race\n";
index = 0;
memset(frames, 0, sizeof(frames));
thread w(writer);
thread r(reader);
w.join();
r.join();
return 0;
}
So the strategy is as follows, we have a circular buffer, where the writer is writing into, it's the only one that mutates the index
variable, so the first read is memory_order_relaxed
. Then we update the value in the array, and then we store the new "full" index, this time with a memory_order_release
. The reader only reads the index (this case with a memory_order_acquire
) and then indexes of that location in the array.
So my questions are:
fences
guarantee that the writes to frames
array happen before the update of the index?fences
cleared in the read thread every time we do a memory_order_acquire
?memory_order_relaxed
read of the index
variable, since we know that our thread is the only one reading this, and we don't care about the values in the frames
array?Upvotes: 0
Views: 104