c00000fd
c00000fd

Reputation: 22327

Compiler choice of not using REP MOVSB instruction for a byte array move

I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:

//ncbSzBuffDataUsed of type INT32

UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
    pDst[i] = pSrc[i];
}

as such:

enter image description here

        UINT8* pDst = (UINT8*)(pMXB + 1);
        UINT8* pSrc = (UINT8*)pDPE;
        for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2             movsxd      r8,edx  
00007FF664412521 4C 2B D1             sub         r10,rcx  
00007FF664412524 0F 1F 40 00          nop         dword ptr [rax]  
00007FF664412528 0F 1F 84 00 00 00 00 00 nop         dword ptr [rax+rax]  

00007FF664412530 41 0F B6 04 0A       movzx       eax,byte ptr [r10+rcx]  
        {
            pDst[i] = pSrc[i];
00007FF664412535 88 01                mov         byte ptr [rcx],al  
00007FF664412537 48 8D 49 01          lea         rcx,[rcx+1]  
00007FF66441253B 49 83 E8 01          sub         r8,1  
00007FF66441253F 75 EF                jne         _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)  
        }

versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?

Upvotes: 4

Views: 1055

Answers (2)

catnip
catnip

Reputation: 25408

Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://learn.microsoft.com/en-us/cpp/intrinsics/movsb.

As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:

  • set up rsi (= source address)
  • set up rdi (= destination address)
  • set up rcx (= count)
  • issue the rep movsb

So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.

Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.

In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.

More on the x64 register passing convention here, and on the x64 ABI generally here.


Edit 2 (again inspired by Peter's comments):

The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.

Upvotes: 5

c00000fd
c00000fd

Reputation: 22327

This is not really an answer, and I can't jam it all into a comment. I just want to share my additional findings. (This is probably relevant to the Visual Studio compilers only.)

What also makes a difference is how you structure your loops. For instance:

Assuming the following struct definitions:

#define PCALLBACK ULONG64

#pragma pack(push)
#pragma pack(1)
typedef struct {
    ULONG64 ui0;

    USHORT w0;
    USHORT w1;

    //Followed by:
    //  PCALLBACK[] 'array' - variable size array
}DPE;
#pragma pack(pop)

(1) The regular way to structure a for loop. The following code chunk is called somewhere in the middle of a larger serialization function:

PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
for(size_t i = 0; i <  (size_t)info.wNumCallbackFuncs; i++)
{
    pDstClbks[i] = info.callbackFuncs[i];
}

As was mentioned somewhere in the answer on this page, it is clear that the compiler was starved of registers to have produced the following monstrocity (see how it reused rax for the loop end limit, or movzx eax,word ptr [r13] instruction that could've been clearly left out of the loop.)

    PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF7029327CF 48 83 C1 30          add         rcx,30h  
    for(size_t i = 0; i <  (size_t)info.wNumCallbackFuncs; i++)
00007FF7029327D3 66 41 3B 5D 00       cmp         bx,word ptr [r13]  
00007FF7029327D8 73 1F                jae         07FF7029327F9h
00007FF7029327DA 4C 8B C1             mov         r8,rcx  
00007FF7029327DD 4C 2B F1             sub         r14,rcx  
    {
        pDstClbks[i] = info.callbackFuncs[i];
00007FF7029327E0 4B 8B 44 06 08       mov         rax,qword ptr [r14+r8+8]  
00007FF7029327E5 48 FF C3             inc         rbx  
00007FF7029327E8 49 89 00             mov         qword ptr [r8],rax  
00007FF7029327EB 4D 8D 40 08          lea         r8,[r8+8]  
00007FF7029327EF 41 0F B7 45 00       movzx       eax,word ptr [r13]  
00007FF7029327F4 48 3B D8             cmp         rbx,rax  
00007FF7029327F7 72 E7                jb          07FF7029327E0h
    }
00007FF7029327F9 45 0F B7 C7          movzx       r8d,r15w  

(2) So if I re-write it into a less familiar C pattern:

PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
for(PCALLBACK* pScrClbks = info.callbackFuncs; 
    pDstClbks < pEndDstClbks; 
    pScrClbks++, pDstClbks++)
{
    *pDstClbks = *pScrClbks;
}

this produces a more sensible machine code (on the same compiler, in the same function, in the same project):

    PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF71D7E27C2 48 83 C1 30          add         rcx,30h  
    PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
00007FF71D7E27C6 0F B7 86 88 00 00 00 movzx       eax,word ptr [rsi+88h]  
00007FF71D7E27CD 48 8D 14 C1          lea         rdx,[rcx+rax*8]  
    for(PCALLBACK* pScrClbks = info.callbackFuncs; pDstClbks < pEndDstClbks; pScrClbks++, pDstClbks++)
00007FF71D7E27D1 48 3B CA             cmp         rcx,rdx  
00007FF71D7E27D4 76 14                jbe         07FF71D7E27EAh
00007FF71D7E27D6 48 2B F1             sub         rsi,rcx  
    {
        *pDstClbks = *pScrClbks;
00007FF71D7E27D9 48 8B 44 0E 08       mov         rax,qword ptr [rsi+rcx+8]  
00007FF71D7E27DE 48 89 01             mov         qword ptr [rcx],rax  
00007FF71D7E27E1 48 83 C1 08          add         rcx,8  
00007FF71D7E27E5 48 3B CA             cmp         rcx,rdx  
00007FF71D7E27E8 77 EF                jb          07FF71D7E27D9h
    }

00007FF71D7E27EA 45 0F B7 C6          movzx       r8d,r14w  

Upvotes: 0

Related Questions