Huge performance difference in byte-array access between x64 and x86

Question

I'm currenty doing micro-benchmarks for a better understanding of clr performance and version issues. The micro-benchmark in question is xoring two byte arrays of each 64 bytes together.

I'm always making a reference implementation with safe .net before I try to beat the .net framework implementation with unsafe and so on.

My reference implementation in question is:

for (int p = 0; p < 64; p++)
    a[p] ^= b[p];

where a and b are byte[] a = new byte[64] and filled with data from .NET rng.

This code runs on x64 as double as fast as on x86. First I thought this is ok, because the jit will make something like *long^=*long out of it and *int^=*int on x86.

But my optimized unsafe-version:

fixed (byte* pA = a)
fixed (byte* pB = b)
{
    long* ppA = (long*)pA;
    long* ppB = (long*)pB;

    for (int p = 0; p < 8; p++)
    {
        *ppA ^= *ppB;

        ppA++;
        ppB++;
    }
}

runs about factor 4 times faster than the x64 reference-implementation. So my thoughts about *long^=*long and *int^=*int optimization of the compiler are not right.

Where does this huge performance difference in the reference implementation come from? Now that I posted the ASM code: Why can't the C# compiler also optimize the x86 version this way?

IL code for x86 and x64 reference implementation (they are identical):

IL_0059: ldloc.3
IL_005a: ldloc.s p
IL_005c: ldelema [mscorlib]System.Byte
IL_0061: dup
IL_0062: ldobj [mscorlib]System.Byte
IL_0067: ldloc.s b
IL_0069: ldloc.s p
IL_006b: ldelem.u1
IL_006c: xor
IL_006d: conv.u1
IL_006e: stobj [mscorlib]System.Byte
IL_0073: ldloc.s p
IL_0075: ldc.i4.1
IL_0076: add
IL_0077: stloc.s p

IL_0079: ldloc.s p
IL_007b: ldc.i4.s 64
IL_007d: blt.s IL_0059

I think that ldloc.3 is a.

Resulting ASM code for x86:

                for (int p = 0; p < 64; p++)
010900DF  xor         edx,edx
010900E1  mov         edi,dword ptr [ebx+4]
                    a[p] ^= b[p];
010900E4  cmp         edx,edi
010900E6  jae         0109010C
010900E8  lea         esi,[ebx+edx+8]
010900EC  mov         eax,dword ptr [ebp-14h]
010900EF  cmp         edx,dword ptr [eax+4]
010900F2  jae         0109010C
010900F4  movzx       eax,byte ptr [eax+edx+8]
010900F9  xor         byte ptr [esi],al
                for (int p = 0; p < 64; p++)
010900FB  inc         edx
010900FC  cmp         edx,40h
010900FF  jl          010900E4

Resulting ASM code for x64:

                    a[p] ^= b[p];
00007FFF4A8B01C6  mov         eax,3Eh
00007FFF4A8B01CB  cmp         rax,rcx
00007FFF4A8B01CE  jae         00007FFF4A8B0245
00007FFF4A8B01D0  mov         rax,qword ptr [rbx+8]
00007FFF4A8B01D4  mov         r9d,3Eh
00007FFF4A8B01DA  cmp         r9,rax
00007FFF4A8B01DD  jae         00007FFF4A8B0245
00007FFF4A8B01DF  mov         r9d,3Fh
00007FFF4A8B01E5  cmp         r9,rcx
00007FFF4A8B01E8  jae         00007FFF4A8B0245
00007FFF4A8B01EA  mov         ecx,3Fh
00007FFF4A8B01EF  cmp         rcx,rax
00007FFF4A8B01F2  jae         00007FFF4A8B0245
00007FFF4A8B01F4  nop         word ptr [rax+rax]
00007FFF4A8B0200  movzx       ecx,byte ptr [rdi+rdx+10h]
00007FFF4A8B0205  movzx       eax,byte ptr [rbx+rdx+10h]
00007FFF4A8B020A  xor         ecx,eax
00007FFF4A8B020C  mov         byte ptr [rdi+rdx+10h],cl
00007FFF4A8B0210  movzx       ecx,byte ptr [rdi+rdx+11h]
00007FFF4A8B0215  movzx       eax,byte ptr [rbx+rdx+11h]
00007FFF4A8B021A  xor         ecx,eax
00007FFF4A8B021C  mov         byte ptr [rdi+rdx+11h],cl
00007FFF4A8B0220  add         rdx,2
                for (int p = 0; p < 64; p++)
00007FFF4A8B0224  cmp         rdx,40h
00007FFF4A8B0228  jl          00007FFF4A8B0200

Ben Voigt · Accepted Answer

You've made a classic mistake, attempting performance analysis on non-optimized code. Here is a complete minimal compilable example:

using System;

namespace SO30558357
{
    class Program
    {
        static void XorArray(byte[] a, byte[] b)
        {
            for (int p = 0; p< 64; p++)
                a[p] ^= b[p];
        }

        static void Main(string[] args)
        {
            byte[] a = new byte[64];
            byte[] b = new byte[64];
            Random r = new Random();

            r.NextBytes(a);
            r.NextBytes(b);

            XorArray(a, b);
            Console.ReadLine();  // when the program stops here
                                 // use Debug -> Attach to process
        }
    }
}

I compiled that using Visual Studio 2013 Update 3, default "Release Build" settings for a C# console application except for the architecture, and ran it with CLR v4.0.30319. Oh I think I have Roslyn installed, but that shouldn't replace the JIT, only the translation to MSIL which is identical on both architectures.

The actual x86 assembly for XorArray:

006F00D8  push        ebp  
006F00D9  mov         ebp,esp  
006F00DB  push        edi  
006F00DC  push        esi  
006F00DD  push        ebx  
006F00DE  push        eax  
006F00DF  mov         dword ptr [ebp-10h],edx  
006F00E2  xor         edi,edi  
006F00E4  mov         ebx,dword ptr [ecx+4]  
006F00E7  cmp         edi,ebx  
006F00E9  jae         006F010F  
006F00EB  lea         esi,[ecx+edi+8]  
006F00EF  movzx       eax,byte ptr [esi]  
006F00F2  mov         edx,dword ptr [ebp-10h]  
006F00F5  cmp         edi,dword ptr [edx+4]  
006F00F8  jae         006F010F  
006F00FA  movzx       edx,byte ptr [edx+edi+8]  
006F00FF  xor         eax,edx  
006F0101  mov         byte ptr [esi],al  
006F0103  inc         edi  
006F0104  cmp         edi,40h  
006F0107  jl          006F00E7  
006F0109  pop         ecx  
006F010A  pop         ebx  
006F010B  pop         esi  
006F010C  pop         edi  
006F010D  pop         ebp  
006F010E  ret

And for x64:

00007FFD4A3000FB  mov         rax,qword ptr [rsi+8]  
00007FFD4A3000FF  mov         rax,qword ptr [rbp+8]  
00007FFD4A300103  nop         word ptr [rax+rax]  
00007FFD4A300110  movzx       ecx,byte ptr [rsi+rdx+10h]  
00007FFD4A300115  movzx       eax,byte ptr [rdx+rbp+10h]  
00007FFD4A30011A  xor         ecx,eax  
00007FFD4A30011C  mov         byte ptr [rsi+rdx+10h],cl  
00007FFD4A300120  movzx       ecx,byte ptr [rsi+rdx+11h]  
00007FFD4A300125  movzx       eax,byte ptr [rdx+rbp+11h]  
00007FFD4A30012A  xor         ecx,eax  
00007FFD4A30012C  mov         byte ptr [rsi+rdx+11h],cl  
00007FFD4A300130  movzx       ecx,byte ptr [rsi+rdx+12h]  
00007FFD4A300135  movzx       eax,byte ptr [rdx+rbp+12h]  
00007FFD4A30013A  xor         ecx,eax  
00007FFD4A30013C  mov         byte ptr [rsi+rdx+12h],cl  
00007FFD4A300140  movzx       ecx,byte ptr [rsi+rdx+13h]  
00007FFD4A300145  movzx       eax,byte ptr [rdx+rbp+13h]  
00007FFD4A30014A  xor         ecx,eax  
00007FFD4A30014C  mov         byte ptr [rsi+rdx+13h],cl  
00007FFD4A300150  add         rdx,4  
00007FFD4A300154  cmp         rdx,40h  
00007FFD4A300158  jl          00007FFD4A300110

Bottom line: The x64 optimizer worked a lot better. While it still is using byte-sized transfers, it unrolled the loop by a factor of 4, and inlined the function call.

Since in the x86 version, loop control logic corresponds to roughly half the code, the unrolling can be expected to yield almost twice the performance.

Inlining allowed the compiler to perform context-sensitive optimization, knowing the size of the arrays and eliminating the runtime bounds check.

If we inline by hand, the x86 compiler now yields:

00A000B1  xor         edi,edi  
00A000B3  mov         eax,dword ptr [ebp-10h]  
00A000B6  mov         ebx,dword ptr [eax+4]  
                a[p] ^= b[p];
00A000B9  mov         eax,dword ptr [ebp-10h]  
00A000BC  cmp         edi,ebx  
00A000BE  jae         00A000F5  
00A000C0  lea         esi,[eax+edi+8]  
00A000C4  movzx       eax,byte ptr [esi]  
00A000C7  mov         edx,dword ptr [ebp-14h]  
00A000CA  cmp         edi,dword ptr [edx+4]  
00A000CD  jae         00A000F5  
00A000CF  movzx       edx,byte ptr [edx+edi+8]  
00A000D4  xor         eax,edx  
00A000D6  mov         byte ptr [esi],al  
            for (int p = 0; p< 64; p++)
00A000D8  inc         edi  
00A000D9  cmp         edi,40h  
00A000DC  jl          00A000B9

Didn't help that much, the loop still does not unroll and the runtime bounds checking is still there.

Notably, the x86 compiler found a register (EBX) to cache the length of one array, but ran out of registers and was forced to access the other array length from memory on every iteration. This should be a "cheap" L1 cache access, but that's still slower than register access, and much slower than no bounds check at all.

Huge performance difference in byte-array access between x64 and x86

Answers (1)

Related Questions