Understanding factors in latency bounded memcpy/memset x86_64

Question

I've been looking at some stackoverflow posts (Why is std::fill(0) slower than std::fill(1)? and Enhanced REP MOVSB for memcpy) and one factor that determined the optimal memcpy/memset strategy appears to be whether the operation would be latency or DRAM bandwidth bound. One of the points was that rep movsb had a longer handoff latency than normal writes which I don't understand.

Why does ERMSB rep movsb have longer handoff latency than a movaps (or any other normal write) loop for memcpy/memset

BeeOnRope commented write:

The behavior described above of rep movsb versus an explicit loop of movaps on a single core across various buffer sizes is pretty consistent with what we have seen before on server cores. As you point out, the competition is between a non-RFO protocol [Read For Ownership] and the RFO protocol. The former uses less bandwidth between all cache levels, but especially on server chips has a long latency handoff all the way to memory. Since a single core is generally concurrency limited, the latency matters, and the non-RFO protocol wins, which is what you see in the region beyond the 30 MB L3

In Enhanced REP MOVSB for memcpy, however, BeeOnRope says

If you are concurrency limited, however, the situation equalizes and sometimes reverses, however. You have DRAM bandwidth to spare, so NT stores don't help and they can even hurt since they may increase the latency since the handoff time for the line buffer may be longer than a scenario where prefetch brings the RFO line into LLC (or even L2) and then the store completes in LLC for an effective lower latency. Finally, server uncores tend to have much slower NT stores than client ones (and high bandwidth), which accentuates this effect.

I am having trouble understanding how the non-RFO method (rep movsb) could have a longer latency handoff with the explanation for where latency handoff comes being whether the LFB (line-fill-buffer) has to handoff to cached memory in L2/LLC or to DRAM.

The Enhanced REP MOVSB for memcpy post discusses the advantages of rep movsb on of which is:

Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting memcpy-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb knows exactly the region size and can prefetch exactly.

Given that rep movsb is prefetching (more effectively than a movaps loop) wouldn't you expect a higher (or equal) likelihood that the LFB will be able to handoff to a line in at least L2/LLC compared to a movaps loop. If this is the case I don't understand:

The former uses less bandwidth between all cache levels, but especially on server chips has a long latency handoff all the way to memory

particularly whether the long latency handoff is coming from.

so my questions are

where is extra handoff latency on the LFB in rep movsb coming from?
more generally what contributes to latency bounds in rep movsb and memcpy/memset?

Understanding factors in latency bounded memcpy/memset x86_64

Answers (0)

Related Questions