Compiler choice of not using REP MOVSB instruction for a byte array move

Question

I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:

//ncbSzBuffDataUsed of type INT32

UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
    pDst[i] = pSrc[i];
}

as such:

        UINT8* pDst = (UINT8*)(pMXB + 1);
        UINT8* pSrc = (UINT8*)pDPE;
        for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2             movsxd      r8,edx  
00007FF664412521 4C 2B D1             sub         r10,rcx  
00007FF664412524 0F 1F 40 00          nop         dword ptr [rax]  
00007FF664412528 0F 1F 84 00 00 00 00 00 nop         dword ptr [rax+rax]  

00007FF664412530 41 0F B6 04 0A       movzx       eax,byte ptr [r10+rcx]  
        {
            pDst[i] = pSrc[i];
00007FF664412535 88 01                mov         byte ptr [rcx],al  
00007FF664412537 48 8D 49 01          lea         rcx,[rcx+1]  
00007FF66441253B 49 83 E8 01          sub         r8,1  
00007FF66441253F 75 EF                jne         _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)  
        }

versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?

catnip · Accepted Answer

Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://learn.microsoft.com/en-us/cpp/intrinsics/movsb.

As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:

set up rsi (= source address)
set up rdi (= destination address)
set up rcx (= count)
issue the rep movsb

So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.

Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.

In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.

More on the x64 register passing convention here, and on the x64 ABI generally here.

Edit 2 (again inspired by Peter's comments):

The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.

Compiler choice of not using REP MOVSB instruction for a byte array move

Answers (2)

Related Questions