Reputation: 2512
I am reading the implementation of rigtorp's SPSCQueue, which is a very elegant design and have very good benchmark.
I understand most of the philosophy of design as described in the README. What I could not understand is how the writeIdxCache_
and readIdxCache_
helps.
Looking at the following function for fetching an item from the queue
RIGTORP_NODISCARD T *front() noexcept {
auto const readIdx = readIdx_.load(std::memory_order_relaxed);
if (readIdx == writeIdxCache_) {
writeIdxCache_ = writeIdx_.load(std::memory_order_acquire);
if (writeIdxCache_ == readIdx) {
return nullptr;
}
}
return &slots_[readIdx + kPadding];
}
If I compare the above implementation with the following ones, what is the benefit?
What is the benefit of using the writeIdxCache_
in the first place
RIGTORP_NODISCARD T *front() noexcept {
auto const readIdx = readIdx_.load(std::memory_order_relaxed);
if (readIdx == writeIdx_.load(std::memory_order_acquire)) {
return nullptr;
}
return &slots_[readIdx + kPadding];
}
readIdxCache_
Here I directly set the readIdxCache_
as well when loading it. Is this approach still working, or it would break the queue? Is it better or worse than case 0?
RIGTORP_NODISCARD T *front() noexcept {
readIdxCache_ = readIdx_.load(std::memory_order_relaxed);
if (readIdxCache_ == writeIdxCache_) {
writeIdxCache_ = writeIdx_.load(std::memory_order_acquire);
if (writeIdxCache_ == readIdxCache_) {
return nullptr;
}
}
return &slots_[readIdx + kPadding];
}
Upvotes: 1
Views: 178
Reputation: 1
re. case 2: directly set readIdxCache_
This looks broken to me.
readIdxCache_
is purely local to the writer thread. It can't/shouldn't be shared with the reader
.
readIdxCache_
is an optimization (as Peter pointed out) to reduce the frequency of writer's
access to reader's cursor (cacheline specifically).
i.e. once readIdxCache_
is set, writer
can now make fast progress until its cursor again reaches readIdxCache_
- before it needs to hit the current reader cursor (cacheline) again to refresh the readIdxCache_
.
In addition to bringing reader's
cursor to writer's
L1 cache,
this makes the reader's next store to its cursor way more expensive
since it now has to issue an RFO to bring the shared cachline back to EXCLUSIVE state (MESI).
Upvotes: 0
Reputation: 365517
For a CPU to commit a store to a cache line, it needs exclusive ownership of that cache line. (MESI Modified or Exclusive state). If one thread is writing a shared variable and no other threads are reading it, things are efficient and the cache line containing it can stay in that state.
But if another thread so much as reads it, its MESI Share Request will transition the cache line from Modified to Shared, so the writer will need a Read For Ownership before it can commit its next store. Also, the thread doing the reading will have to wait for that cache miss (share request) on its loads, if the writer has written since its last read. (A Read For Ownership invalidates all other copies of the cache line in other cores.)
Without the index caches, every call to front()
reads both shared variables, including the one the other thread sometimes writes. This creates contention on writeIdx_
as cache-coherency messages for it have to get sent for most reads and most writes. (And similarly for the writer creating contention on readIdx_
by reading it every time.)
With the index caches, front()
only touches writeIdx_
occasionally, only after all the queue entries it saw last time have been read. And the writers similarly avoids touching readIdx_
often, so if you're removing entries from the queue, writes to readIdx_
can be efficient.
Upvotes: 3