Does movnt interact with lock-prefixed instructions?

Question

I have an application that streams data using the movnt family of instructions for non-temporal write operations on regular (write-back) memory. Then the data is handed off to a different thread for normal (write-back) reading. The Intel reference manual volume 1 says that such weakly-ordered stores should be followed by an sfence or mfence instruction.

With coherent requests, a streaming store can be used in the same way as a regular store that has been mapped with a WC memory type (PAT or MTRR). An SFENCE instruction must be used within a producer-consumer usage model in order to ensure coherency and visibility of data between processors

What is not clear to me is whether non-temporal stores are also flushed by atomic read-modify-write instructions like lock cmpxchg which I understand to also act as full memory barriers.

Volume 3 of the reference manual states:

For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

I interpret this to mean that a lock-prefixed operation should work instead of an sfence for the writer since only weakly ordered loads are excluded. And since the reading thread doesn't use WC memory or NT loads, it is not affected as long as its access is synchronized with that lock-prefixed operation of the writer. Is that interpretation correct? Does that mean that something like a mutex lock+unlock, which should always include some atomic read-modify-write makes the sfence instruction superfluous?

Conversely, using lock operations will also flush the line-fill buffers used for non-temporal stores. Under pathological conditions, my code may have a mutex unlock + lock every 4 kiB of streaming writes (in a place which would not require an sfence). Is this interpretation correct and will this have a negative effect on performance?

To illustrate that last point, here is an abridged version of the code in question:

class ThreadSafeQueue
{
    void push(MemoryBlock&&);
    MemoryBlock pop();
};
class SimpleQueue
{
    void push(MemoryBlock&&);
    MemoryBlock pop();
};

void thread(SimpleQueue& input, ThreadSafeQueue& output)
{
    std::mutex mutex;
    MemoryBlock tmp;
    while(MemoryBlock inblock = input.pop()) { // <-- no atomic or mutex
        mutex.lock(); // mutex for unrelated reasons (not shown)
        tmp.append(inblock); // append uses movnt, ca. 4 kiB at once
        if(tmp.full()) { // only once every few MiB
            _mm_sfence(); // <-- necessary if followed by mutex?
            output.push(std::move(tmp)); // <-- uses mutex
            tmp = MemoryBlock{};
        }
        mutex.unlock(); // <-- memory barrier acts as unnecessary sfence?
    }
}

So, to summarize:

Do weakly ordered stores with movnt interact with atomic RMW on x86 in a way that

we can avoid issuing an sfence if it is followed by a RMW?
we should avoid atomic RMW in between movnt instructions that would not normally have an sfence between them?

Related: Acquire/release semantics with non-temporal stores on x64

Does movnt interact with lock-prefixed instructions?

So, to summarize:

Answers (0)

Related Questions