Reputation: 21727
For those who are interested in how I do the benchmark, look here , I simple replace / add a couple of methods near the build in "Loop 1K" method.
Sorry, I forgot to say my testing environment. .Net 4.5 x64 (don't pick 32bit preferred). in x86 both methods take 5x as much as time.
Loop2
takes 3x as much time as Loop
. I thought that x++
/ x+=y
should not slow down when x
gets larger (since it takes 1 or 2 cpu instructions any way)
Is it due to Locality of reference? However I thought that in Loop2
there are not many variables, they should all be close to each other...
public long Loop(long testSize)
{
long ret = 0;
for (long i = 0; i < testSize; i++)
{
long p = 0;
for (int j = 0; j < 1000; j++)
{
p+=10;
}
ret+=p;
}
return ret;
}
public long Loop2(long testSize)
{
long ret = 0;
for (long i = 0; i < testSize; i++)
{
for (int j = 0; j < 1000; j++)
{
ret+=10;
}
}
return ret;
}
Update: When, if ever, is loop unrolling still useful? is useful
Upvotes: 9
Views: 476
Reputation: 31
The outer loop is the same in both cases but that is what blocks the compiler to optimize the code in the second case.
The problem is that the variable ret is not declared close enough to the inner loop so it is not in the outer loop's body. The ret variable is outside the outer loop which means that it is out of scope for the compiler optimizer which can't optimize the code through 2 loops.
However the variable p is declared right before the inner loop that's why it is well optimized.
Upvotes: 0
Reputation: 30695
It's been said before a few times that the x86 JIT does a better job than the x64 JIT when it comes to optimization and it looks like that is what is happening in this case. Although the loops are performing essentially the same thing, the x64 assembly code that the JITer is creating are fundamentally different, and I think it accounts for the speed difference you're seeing.
The assembly code between the two methods differ in the critical inner loop, which is called 1000*N times. This is what I think is accounting for the speed difference.
Loop 1:
000007fe`97d50240 4d8bd1 mov r10,r9 000007fe`97d50243 4983c128 add r9,28h 000007fe`97d50247 4183c004 add r8d,4 ; Loop while j < 1000d 000007fe`97d5024b 4181f8e8030000 cmp r8d,3E8h 000007fe`97d50252 7cec jl 000007fe`97d50240
Loop 2:
; rax = ret ; ecx = j ; Add 10 to ret 4 times 000007fe`97d50292 48050a000000 add rax,0Ah 000007fe`97d50298 48050a000000 add rax,0Ah 000007fe`97d5029e 48050a000000 add rax,0Ah 000007fe`97d502a4 48050a000000 add rax,0Ah 000007fe`97d502aa 83c104 add ecx,4 ; increment j by 4 ; Loop while j < 1000d 000007fe`97d502ad 81f9e8030000 cmp ecx,3E8h 000007fe`97d502b3 7cdd jl 000007fe`97d50292
You'll notice that the JIT is unrolling the inner loop, but the actual code in the loop differs greatly when it comes to the number of instructions made. Loop 1 is optimized to make a single add statement of 40, where Loop 2 makes 4 add statements of 10.
My (wild) guess is that the JITer can better optimize the variable p
because it is defined in the inner scope of the first loop. Since it can detect that p
is never used outside of that loop and is truly temporary, it can apply different optimizations. In the second loop, you're acting on a variable that is defined and used outside of the scope of both loops, and the optimization rules used in the x64 JIT doesn't recognize it as the same code that could have the same optimizations.
Upvotes: 6
Reputation: 109567
I can confirm this result on my system.
The results of my test are:
x64 Build
00:00:01.1490139 Loop
00:00:02.5043206 Loop2
x32 Build
00:00:04.1832937 Loop
00:00:04.2801726 Loop2
This is a RELEASE build run outside of the debugger.
using System;
using System.Diagnostics;
namespace Demo
{
internal class Program
{
private static void Main()
{
new Program().test();
}
private void test()
{
Stopwatch sw = new Stopwatch();
int count = 10000000;
for (int i = 0; i < 5; ++i)
{
sw.Restart();
Loop(count);
Console.WriteLine(sw.Elapsed + " Loop");
sw.Restart();
Loop2(count);
Console.WriteLine(sw.Elapsed + " Loop2");
Console.WriteLine();
}
}
public long Loop(long testSize)
{
long ret = 0;
for (long i = 0; i < testSize; i++)
{
long p = 0;
for (int j = 0; j < 1000; j++)
{
p++;
}
ret += p;
}
return ret;
}
public long Loop2(long testSize)
{
long ret = 0;
for (long i = 0; i < testSize; i++)
{
for (int j = 0; j < 1000; j++)
{
ret++;
}
}
return ret;
}
}
}
Upvotes: 1
Reputation: 20100
by looking at the IL itself, loop2 should be faster (and it is faster on my computer)
loop IL
.method public hidebysig
instance int64 Loop (
int64 testSize
) cil managed
{
// Method begins at RVA 0x2054
// Code size 48 (0x30)
.maxstack 2
.locals init (
[0] int64 'ret',
[1] int64 i,
[2] int64 p,
[3] int32 j
)
IL_0000: ldc.i4.0
IL_0001: conv.i8
IL_0002: stloc.0
IL_0003: ldc.i4.0
IL_0004: conv.i8
IL_0005: stloc.1
IL_0006: br.s IL_002a
// loop start (head: IL_002a)
IL_0008: ldc.i4.0
IL_0009: conv.i8
IL_000a: stloc.2
IL_000b: ldc.i4.0
IL_000c: stloc.3
IL_000d: br.s IL_0019
// loop start (head: IL_0019)
IL_000f: ldloc.2
IL_0010: ldc.i4.s 10
IL_0012: conv.i8
IL_0013: add
IL_0014: stloc.2
IL_0015: ldloc.3
IL_0016: ldc.i4.1
IL_0017: add
IL_0018: stloc.3
IL_0019: ldloc.3
IL_001a: ldc.i4 1000
IL_001f: blt.s IL_000f
// end loop
IL_0021: ldloc.0
IL_0022: ldloc.2
IL_0023: add
IL_0024: stloc.0
IL_0025: ldloc.1
IL_0026: ldc.i4.1
IL_0027: conv.i8
IL_0028: add
IL_0029: stloc.1
IL_002a: ldloc.1
IL_002b: ldarg.1
IL_002c: blt.s IL_0008
// end loop
IL_002e: ldloc.0
IL_002f: ret
} // end of method Program::Loop
loop2 IL
.method public hidebysig
instance int64 Loop2 (
int64 testSize
) cil managed
{
// Method begins at RVA 0x2090
// Code size 41 (0x29)
.maxstack 2
.locals init (
[0] int64 'ret',
[1] int64 i,
[2] int32 j
)
IL_0000: ldc.i4.0
IL_0001: conv.i8
IL_0002: stloc.0
IL_0003: ldc.i4.0
IL_0004: conv.i8
IL_0005: stloc.1
IL_0006: br.s IL_0023
// loop start (head: IL_0023)
IL_0008: ldc.i4.0
IL_0009: stloc.2
IL_000a: br.s IL_0016
// loop start (head: IL_0016)
IL_000c: ldloc.0
IL_000d: ldc.i4.s 10
IL_000f: conv.i8
IL_0010: add
IL_0011: stloc.0
IL_0012: ldloc.2
IL_0013: ldc.i4.1
IL_0014: add
IL_0015: stloc.2
IL_0016: ldloc.2
IL_0017: ldc.i4 1000
IL_001c: blt.s IL_000c
// end loop
IL_001e: ldloc.1
IL_001f: ldc.i4.1
IL_0020: conv.i8
IL_0021: add
IL_0022: stloc.1
IL_0023: ldloc.1
IL_0024: ldarg.1
IL_0025: blt.s IL_0008
// end loop
IL_0027: ldloc.0
IL_0028: ret
} // end of method Program::Loop2
Upvotes: 1
Reputation: 9488
I have run my own test and I don't see any significant difference. Try it:
using System;
using System.Diagnostics;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
while (true)
{
sw.Start();
Loop(5000000);
sw.Stop();
Console.WriteLine("Loop: {0}ms", sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
Loop2(5000000);
sw.Stop();
Console.WriteLine("Loop2: {0}ms", sw.ElapsedMilliseconds);
sw.Reset();
Console.ReadLine();
}
}
static long Loop(long testSize)
{
long ret = 0;
for (long i = 0; i < testSize; i++)
{
long p = 0;
for (int j = 0; j < 1000; j++)
{
p++;
}
ret += p;
}
return ret;
}
static long Loop2(long testSize)
{
long ret = 0;
for (long i = 0; i < testSize; i++)
{
for (int j = 0; j < 1000; j++)
{
ret++;
}
}
return ret;
}
}
}
So, my answer: reason is in your overcomplicated measurment system.
Upvotes: 0
Reputation: 77546
I am not seeing any appreciable difference in performance. Using this LinqPad script (and including those two methods of yours):
void Main()
{
// Warmup the vm
Loop(10);
Loop2(10);
var stopwatch = Stopwatch.StartNew();
Loop(10 * 1000 * 1000);
stopwatch.Stop();
stopwatch.Elapsed.Dump();
stopwatch = Stopwatch.StartNew();
Loop2(10 * 1000 * 1000);
stopwatch.Stop();
stopwatch.Elapsed.Dump();
}
Prints out (in LinqPad);
00:00:22.7749976
00:00:22.6971114
When reversing the order of the Loop
/ Loop2
calls, the results are similar:
00:00:22.7572688
00:00:22.6758102
This seems to indicate that the performance is the same. Perhaps you didn't warm up the VM?
Upvotes: 2
Reputation: 11
Loop should be faster than Loop2, the only explanation that comes to my mind is that compiler optimizing kicks in and reduces the long p = 0;
for (int j = 0; j < 1000; j++)
{
p++;
}
to somthing like long p = 1000;
, checking the generated assembler code would bring clarity.
Upvotes: 1