Reputation: 13438
I have an application that streams data using the movnt
family of instructions for non-temporal write operations on regular (write-back) memory.
Then the data is handed off to a different thread for normal (write-back) reading.
The Intel reference manual volume 1 says that such weakly-ordered stores should be followed by an sfence
or mfence
instruction.
With coherent requests, a streaming store can be used in the same way as a regular store that has been mapped with a WC memory type (PAT or MTRR). An SFENCE instruction must be used within a producer-consumer usage model in order to ensure coherency and visibility of data between processors
What is not clear to me is whether non-temporal stores are also flushed by atomic read-modify-write instructions like lock cmpxchg
which I understand to also act as full memory barriers.
Volume 3 of the reference manual states:
For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
I interpret this to mean that a lock
-prefixed operation should work instead of an sfence
for the writer since only weakly ordered loads are excluded.
And since the reading thread doesn't use WC memory or NT loads, it is not affected as long as its access is synchronized with that lock
-prefixed operation of the writer.
Is that interpretation correct?
Does that mean that something like a mutex lock+unlock, which should always include some atomic read-modify-write makes the sfence
instruction superfluous?
Conversely, using lock
operations will also flush the line-fill buffers used for non-temporal stores.
Under pathological conditions, my code may have a mutex unlock + lock every 4 kiB of streaming writes (in a place which would not require an sfence
).
Is this interpretation correct and will this have a negative effect on performance?
To illustrate that last point, here is an abridged version of the code in question:
class ThreadSafeQueue
{
void push(MemoryBlock&&);
MemoryBlock pop();
};
class SimpleQueue
{
void push(MemoryBlock&&);
MemoryBlock pop();
};
void thread(SimpleQueue& input, ThreadSafeQueue& output)
{
std::mutex mutex;
MemoryBlock tmp;
while(MemoryBlock inblock = input.pop()) { // <-- no atomic or mutex
mutex.lock(); // mutex for unrelated reasons (not shown)
tmp.append(inblock); // append uses movnt, ca. 4 kiB at once
if(tmp.full()) { // only once every few MiB
_mm_sfence(); // <-- necessary if followed by mutex?
output.push(std::move(tmp)); // <-- uses mutex
tmp = MemoryBlock{};
}
mutex.unlock(); // <-- memory barrier acts as unnecessary sfence?
}
}
Do weakly ordered stores with movnt
interact with atomic RMW on x86 in a way that
sfence
if it is followed by a RMW?movnt
instructions that would not normally have an sfence
between them?Related: Acquire/release semantics with non-temporal stores on x64
Upvotes: 1
Views: 42