colinfang
colinfang

Reputation: 21727

How to explain the difference of performance in these 2 simple loops?

For those who are interested in how I do the benchmark, look here , I simple replace / add a couple of methods near the build in "Loop 1K" method.

Sorry, I forgot to say my testing environment. .Net 4.5 x64 (don't pick 32bit preferred). in x86 both methods take 5x as much as time.

Loop2 takes 3x as much time as Loop. I thought that x++ / x+=y should not slow down when x gets larger (since it takes 1 or 2 cpu instructions any way)

Is it due to Locality of reference? However I thought that in Loop2 there are not many variables, they should all be close to each other...

    public long Loop(long testSize)
    {
        long ret = 0;
        for (long i = 0; i < testSize; i++)
        {
            long p = 0;
            for (int j = 0; j < 1000; j++)
            {
                p+=10;
            }
            ret+=p;
        }
        return ret;
    }

    public long Loop2(long testSize)
    {
        long ret = 0;
        for (long i = 0; i < testSize; i++)
        {
            for (int j = 0; j < 1000; j++)
            {
                ret+=10;
            }
        }
        return ret;
    }

Update: When, if ever, is loop unrolling still useful? is useful

Upvotes: 9

Views: 476

Answers (7)

zoty314
zoty314

Reputation: 31

The outer loop is the same in both cases but that is what blocks the compiler to optimize the code in the second case.

The problem is that the variable ret is not declared close enough to the inner loop so it is not in the outer loop's body. The ret variable is outside the outer loop which means that it is out of scope for the compiler optimizer which can't optimize the code through 2 loops.

However the variable p is declared right before the inner loop that's why it is well optimized.

Upvotes: 0

Christopher Currens
Christopher Currens

Reputation: 30695

It's been said before a few times that the x86 JIT does a better job than the x64 JIT when it comes to optimization and it looks like that is what is happening in this case. Although the loops are performing essentially the same thing, the x64 assembly code that the JITer is creating are fundamentally different, and I think it accounts for the speed difference you're seeing.

The assembly code between the two methods differ in the critical inner loop, which is called 1000*N times. This is what I think is accounting for the speed difference.

Loop 1:

000007fe`97d50240 4d8bd1          mov     r10,r9 
000007fe`97d50243 4983c128        add     r9,28h 
000007fe`97d50247 4183c004        add     r8d,4  
; Loop while j < 1000d
000007fe`97d5024b 4181f8e8030000  cmp     r8d,3E8h
000007fe`97d50252 7cec            jl      000007fe`97d50240

Loop 2:


; rax = ret
; ecx = j

; Add 10 to ret 4 times
000007fe`97d50292 48050a000000    add     rax,0Ah
000007fe`97d50298 48050a000000    add     rax,0Ah
000007fe`97d5029e 48050a000000    add     rax,0Ah
000007fe`97d502a4 48050a000000    add     rax,0Ah
000007fe`97d502aa 83c104          add     ecx,4    ; increment j by 4

; Loop while j < 1000d
000007fe`97d502ad 81f9e8030000    cmp     ecx,3E8h
000007fe`97d502b3 7cdd            jl      000007fe`97d50292

You'll notice that the JIT is unrolling the inner loop, but the actual code in the loop differs greatly when it comes to the number of instructions made. Loop 1 is optimized to make a single add statement of 40, where Loop 2 makes 4 add statements of 10.

My (wild) guess is that the JITer can better optimize the variable p because it is defined in the inner scope of the first loop. Since it can detect that p is never used outside of that loop and is truly temporary, it can apply different optimizations. In the second loop, you're acting on a variable that is defined and used outside of the scope of both loops, and the optimization rules used in the x64 JIT doesn't recognize it as the same code that could have the same optimizations.

Upvotes: 6

Matthew Watson
Matthew Watson

Reputation: 109567

I can confirm this result on my system.

The results of my test are:

x64 Build

00:00:01.1490139 Loop
00:00:02.5043206 Loop2

x32 Build

00:00:04.1832937 Loop
00:00:04.2801726 Loop2

This is a RELEASE build run outside of the debugger.

using System;
using System.Diagnostics;

namespace Demo
{
    internal class Program
    {
        private static void Main()
        {
            new Program().test();
        }

        private void test()
        {
            Stopwatch sw = new Stopwatch();

            int count = 10000000;

            for (int i = 0; i < 5; ++i)
            {
                sw.Restart();
                Loop(count);
                Console.WriteLine(sw.Elapsed + " Loop");
                sw.Restart();
                Loop2(count);
                Console.WriteLine(sw.Elapsed + " Loop2");
                Console.WriteLine();
            }
        }


        public long Loop(long testSize)
        {
            long ret = 0;
            for (long i = 0; i < testSize; i++)
            {
                long p = 0;
                for (int j = 0; j < 1000; j++)
                {
                    p++;
                }
                ret += p;
            }
            return ret;
        }

        public long Loop2(long testSize)
        {
            long ret = 0;
            for (long i = 0; i < testSize; i++)
            {
                for (int j = 0; j < 1000; j++)
                {
                    ret++;
                }
            }
            return ret;
        }
    }
}

Upvotes: 1

Fredou
Fredou

Reputation: 20100

by looking at the IL itself, loop2 should be faster (and it is faster on my computer)

loop IL

.method public hidebysig 
instance int64 Loop (
    int64 testSize
) cil managed 
{
// Method begins at RVA 0x2054
// Code size 48 (0x30)
.maxstack 2
.locals init (
    [0] int64 'ret',
    [1] int64 i,
    [2] int64 p,
    [3] int32 j
)

IL_0000: ldc.i4.0
IL_0001: conv.i8
IL_0002: stloc.0
IL_0003: ldc.i4.0
IL_0004: conv.i8
IL_0005: stloc.1
IL_0006: br.s IL_002a
// loop start (head: IL_002a)
    IL_0008: ldc.i4.0
    IL_0009: conv.i8
    IL_000a: stloc.2
    IL_000b: ldc.i4.0
    IL_000c: stloc.3
    IL_000d: br.s IL_0019
    // loop start (head: IL_0019)
        IL_000f: ldloc.2
        IL_0010: ldc.i4.s 10
        IL_0012: conv.i8
        IL_0013: add
        IL_0014: stloc.2
        IL_0015: ldloc.3
        IL_0016: ldc.i4.1
        IL_0017: add
        IL_0018: stloc.3

        IL_0019: ldloc.3
        IL_001a: ldc.i4 1000
        IL_001f: blt.s IL_000f
    // end loop

    IL_0021: ldloc.0
    IL_0022: ldloc.2
    IL_0023: add
    IL_0024: stloc.0
    IL_0025: ldloc.1
    IL_0026: ldc.i4.1
    IL_0027: conv.i8
    IL_0028: add
    IL_0029: stloc.1

    IL_002a: ldloc.1
    IL_002b: ldarg.1
    IL_002c: blt.s IL_0008
// end loop

IL_002e: ldloc.0
IL_002f: ret
} // end of method Program::Loop

loop2 IL

.method public hidebysig 
instance int64 Loop2 (
    int64 testSize
) cil managed 
{
// Method begins at RVA 0x2090
// Code size 41 (0x29)
.maxstack 2
.locals init (
    [0] int64 'ret',
    [1] int64 i,
    [2] int32 j
)

IL_0000: ldc.i4.0
IL_0001: conv.i8
IL_0002: stloc.0
IL_0003: ldc.i4.0
IL_0004: conv.i8
IL_0005: stloc.1
IL_0006: br.s IL_0023
// loop start (head: IL_0023)
    IL_0008: ldc.i4.0
    IL_0009: stloc.2
    IL_000a: br.s IL_0016
    // loop start (head: IL_0016)
        IL_000c: ldloc.0
        IL_000d: ldc.i4.s 10
        IL_000f: conv.i8
        IL_0010: add
        IL_0011: stloc.0
        IL_0012: ldloc.2
        IL_0013: ldc.i4.1
        IL_0014: add
        IL_0015: stloc.2

        IL_0016: ldloc.2
        IL_0017: ldc.i4 1000
        IL_001c: blt.s IL_000c
    // end loop

    IL_001e: ldloc.1
    IL_001f: ldc.i4.1
    IL_0020: conv.i8
    IL_0021: add
    IL_0022: stloc.1

    IL_0023: ldloc.1
    IL_0024: ldarg.1
    IL_0025: blt.s IL_0008
// end loop

IL_0027: ldloc.0
IL_0028: ret
} // end of method Program::Loop2

Upvotes: 1

astef
astef

Reputation: 9488

I have run my own test and I don't see any significant difference. Try it:

using System;
using System.Diagnostics;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Stopwatch sw = new Stopwatch();
            while (true)
            {
                sw.Start();
                Loop(5000000);
                sw.Stop();
                Console.WriteLine("Loop: {0}ms", sw.ElapsedMilliseconds);
                sw.Reset();

                sw.Start();
                Loop2(5000000);
                sw.Stop();
                Console.WriteLine("Loop2: {0}ms", sw.ElapsedMilliseconds);
                sw.Reset();

                Console.ReadLine();
            }
        }

        static long Loop(long testSize)
        {
            long ret = 0;
            for (long i = 0; i < testSize; i++)
            {
                long p = 0;
                for (int j = 0; j < 1000; j++)
                {
                    p++;
                }
                ret += p;
            }
            return ret;
        }

        static long Loop2(long testSize)
        {
            long ret = 0;
            for (long i = 0; i < testSize; i++)
            {
                for (int j = 0; j < 1000; j++)
                {
                    ret++;
                }
            }
            return ret;
        }
    }

}

So, my answer: reason is in your overcomplicated measurment system.

Upvotes: 0

Kirk Woll
Kirk Woll

Reputation: 77546

I am not seeing any appreciable difference in performance. Using this LinqPad script (and including those two methods of yours):

void Main()
{
    // Warmup the vm
    Loop(10);
    Loop2(10);

    var stopwatch = Stopwatch.StartNew();
    Loop(10 * 1000 * 1000);
    stopwatch.Stop();
    stopwatch.Elapsed.Dump();

    stopwatch = Stopwatch.StartNew();
    Loop2(10 * 1000 * 1000);
    stopwatch.Stop();
    stopwatch.Elapsed.Dump();
}

Prints out (in LinqPad);

00:00:22.7749976
00:00:22.6971114

When reversing the order of the Loop / Loop2 calls, the results are similar:

00:00:22.7572688
00:00:22.6758102

This seems to indicate that the performance is the same. Perhaps you didn't warm up the VM?

Upvotes: 2

Cheese
Cheese

Reputation: 11

Loop should be faster than Loop2, the only explanation that comes to my mind is that compiler optimizing kicks in and reduces the long p = 0; for (int j = 0; j < 1000; j++) { p++; } to somthing like long p = 1000;, checking the generated assembler code would bring clarity.

Upvotes: 1

Related Questions