Reputation: 7423
I've been investigating some performance issues for an Event Viewer application that we have on our development site when I noticed an interesting issue in the algorithm. I then created a simplified test project to test two different algorithms. This program basically retrieves Windows Event Logs using the EventLog
class, and then translates those logs into queryable EventLogItem
entities.
This operation is performed and timed using two different loops. The first (backward) loop starts at the index of the last item in the list, translates the item and then decreases the index. The method is defined like this:
private static void TranslateLogsUsingBackwardLoop()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var originalLogs = EventLog.GetEventLogs();
var translatedLogs = new List<EventLogItem>();
Parallel.ForEach<EventLog>(originalLogs, currentLog =>
{
for (int index = currentLog.Entries.Count - 1; index >= 0; index--)
{
var currentEntry = currentLog.Entries[index];
EventLogItem translatedEntry = new EventLogItem
{
MachineName = currentEntry.MachineName,
LogName = currentLog.LogDisplayName,
CreatedTime = currentEntry.TimeGenerated,
Source = currentEntry.Source,
Message = currentEntry.Message,
Number = currentEntry.Index,
Category = currentEntry.Category,
Type = currentEntry.EntryType,
InstanceID = currentEntry.InstanceId,
User = currentEntry.UserName,
};
lock (translatedLogs)
{
translatedLogs.Add(translatedEntry);
}
}
});
stopwatch.Stop();
Console.WriteLine("{0} logs were translated in {1} using backward loop.", translatedLogs.Count, stopwatch.Elapsed);
}
The second (forward) loop starts at index 0 and increments the index. This method is defined like this:
private static void TranslateLogsUsingForwardLoop()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var originalLogs = EventLog.GetEventLogs();
var translatedLogs = new List<EventLogItem>();
Parallel.ForEach<EventLog>(originalLogs, currentLog =>
{
for (int index = 0; index < currentLog.Entries.Count; index++)
{
var currentEntry = currentLog.Entries[index];
EventLogItem translatedEntry = new EventLogItem
{
MachineName = currentEntry.MachineName,
LogName = currentLog.LogDisplayName,
CreatedTime = currentEntry.TimeGenerated,
Source = currentEntry.Source,
Message = currentEntry.Message,
Number = currentEntry.Index,
Category = currentEntry.Category,
Type = currentEntry.EntryType,
InstanceID = currentEntry.InstanceId,
User = currentEntry.UserName,
};
lock (translatedLogs)
{
translatedLogs.Add(translatedEntry);
}
}
});
stopwatch.Stop();
Console.WriteLine("{0} logs were translated in {1} using forward loop.", translatedLogs.Count, stopwatch.Elapsed);
}
And the main method:
static void Main(string[] args)
{
TranslateLogsUsingForwardLoop();
Console.WriteLine();
Thread.Sleep(2000);
TranslateLogsUsingBackwardLoop();
Console.ReadLine();
}
This is what I get (performed this test several times, and results are almost identical):
Note that the server I tested this on logs to the Event Log every second, that's why the number of translated logs are not the same. So why is the backward loop faster? I initially thought it's because in the backward loop algorithm, currentLog.Entries.Count
is evaluated just once, where as in the forward loop it needs to be calculated and compared against index
on every loop iteration, but then again that doesn't seem right. Any ideas?
Upvotes: 3
Views: 2154
Reputation: 2180
There is substantially zero difference between forwards and backwards iteration of List<T>
and T[]
. However, there is substantial difference between caching and not caching the loop boundary, and backwards iteration automatically effectively caches the loop boundary (it takes unnatural effort to avoid this).
I've included a benchmark and a result. It is evident that the *BoundaryCached*
tests are respectively equally fast while the *BoundaryUncached*
tests are markedly slower. This result is consistent for sizes of 1, 5, 10, 100, 1,000, and 800,000. My CPU has a 512 KiB L1 cache, in which 800,000 32 bit integers will not fit.
BenchmarkDotNet=v0.13.1, OS=opensuse-leap 15.3
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.202
[Host] : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT
DefaultJob : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT
Method | items | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
For_List_BoundaryUncached_Forwards | 10 | 37.1188 ns | 0.1784 ns | 0.1582 ns | 1.00 |
For_List_BoundaryCached_Forwards | 10 | 21.4687 ns | 0.0858 ns | 0.0717 ns | 0.58 |
For_List_BoundaryCached_Backwards | 10 | 21.6920 ns | 0.2395 ns | 0.2241 ns | 0.58 |
For_Array_BoundaryUncached_Forwards | 10 | 5.6312 ns | 0.0164 ns | 0.0145 ns | 0.15 |
For_Array_BoundaryCached_Forwards | 10 | 4.2638 ns | 0.0672 ns | 0.0629 ns | 0.11 |
For_Array_BoundaryCached_Backwards | 10 | 4.6106 ns | 0.0338 ns | 0.0316 ns | 0.12 |
For_List_BoundaryUncached_Forwards | 1000 | 3,602.5782 ns | 11.2947 ns | 10.5650 ns | 1.00 |
For_List_BoundaryCached_Forwards | 1000 | 1,922.8651 ns | 14.5266 ns | 13.5882 ns | 0.53 |
For_List_BoundaryCached_Backwards | 1000 | 1,911.4424 ns | 8.7750 ns | 8.2081 ns | 0.53 |
For_Array_BoundaryUncached_Forwards | 1000 | 469.6596 ns | 0.8472 ns | 0.6614 ns | 0.13 |
For_Array_BoundaryCached_Forwards | 1000 | 469.5290 ns | 0.5630 ns | 0.4991 ns | 0.13 |
For_Array_BoundaryCached_Backwards | 1000 | 462.3722 ns | 1.1469 ns | 1.0728 ns | 0.13 |
For_List_BoundaryUncached_Forwards | 800000 | 2,841,438.8663 ns | 19,331.9630 ns | 16,143.0646 ns | 1.00 |
For_List_BoundaryCached_Forwards | 800000 | 1,525,126.0399 ns | 8,127.2104 ns | 7,204.5603 ns | 0.54 |
For_List_BoundaryCached_Backwards | 800000 | 1,544,717.2824 ns | 5,226.2047 ns | 4,888.5952 ns | 0.54 |
For_Array_BoundaryUncached_Forwards | 800000 | 373,252.2978 ns | 1,450.3420 ns | 1,356.6508 ns | 0.13 |
For_Array_BoundaryCached_Forwards | 800000 | 373,244.7567 ns | 933.6585 ns | 827.6639 ns | 0.13 |
For_Array_BoundaryCached_Backwards | 800000 | 373,297.0415 ns | 705.1777 ns | 659.6236 ns | 0.13 |
// MIT or Apache-2.0
using BenchmarkDotNet.Attributes;
using System.Diagnostics;
public class LoopForwardsBackwards
{
private IList<int>? valuesList;
private int[]? valuesArray;
[Params(10, 1000, 800000)]
public int items;
[GlobalSetup]
public void GlobalSetup()
{
var numbers = Enumerable.Range(0, items).ToList();
valuesList = numbers;
valuesArray = valuesList.ToArray();
}
[Benchmark(Baseline = true)]
public int For_List_BoundaryUncached_Forwards()
{
Debug.Assert(valuesList is not null);
int n = 0;
for (int i = 0; i < valuesList.Count; ++i)
n += valuesList[i];
return n;
}
[Benchmark]
public int For_List_BoundaryCached_Forwards()
{
Debug.Assert(valuesList is not null);
int n = 0;
int len = valuesList.Count;
for (int i = 0; i < len; ++i)
n += valuesList[i];
return n;
}
[Benchmark]
public int For_List_BoundaryCached_Backwards()
{
Debug.Assert(valuesList is not null);
int n = 0;
for (int i = valuesList.Count - 1; i >= 0; --i)
n += valuesList[i];
return n;
}
[Benchmark]
public int For_Array_BoundaryUncached_Forwards()
{
Debug.Assert(valuesArray is not null);
int n = 0;
for (int i = 0; i < valuesArray.Length; ++i)
n += valuesArray[i];
return n;
}
[Benchmark]
public int For_Array_BoundaryCached_Forwards()
{
Debug.Assert(valuesArray is not null);
int n = 0;
int len = valuesArray.Length;
for (int i = 0; i < len; ++i)
n += valuesArray[i];
return n;
}
[Benchmark]
public int For_Array_BoundaryCached_Backwards()
{
Debug.Assert(valuesArray is not null);
int n = 0;
for (int i = valuesArray.Length - 1; i >= 0; --i)
n += valuesArray[i];
return n;
}
}
Upvotes: 1
Reputation:
The first loop is slower because it's first, not because it's forwards.
Caching
Modern CPUs cache data (in Level 1 and Level 2 caches). This is slow the first time the data gets accessed, then faster for subsequent accesses.
var currentEntry = currentLog.Entries[index];
The first loop will take longer, because its loading from slow RAM into L2 cache.
I would expect the second loop to be faster, regardless of how it's written, because its loading from L2 cache.
List<T>
Lists are ever-expanding arrays. They start off small (capacity 4) then double their capacity as required. Each reallocation is very slow.
var translatedLogs = new List<EventLogItem>();
...
translatedLogs.Add(translatedEntry);
The first loop will reallocate fairly often: 4, 8, 16, 32, 64
The second loop will reallocate less often: 64, 128
So you would expect the second loop (regardless of how its written) to be faster.
CPU optimizations
Weird things happen because processors are so complex. You can no longer predict code speed the way we used to in the old days :-)
Why is processing a sorted array faster than an unsorted array?
Upvotes: 0
Reputation: 327
Old question and this may not be exact reason in this case but there's a difference when the loops get down to the IL or assembly or whatever your language's lower language happens to be. At a very minimum with the normal for loop you're getting the count value and then doing a comparison of you index variable against that on every loop. In a reverse loop you get the count once as your starting point and then the comparison is always against 0 which is easier to compare and the compilers can even optimize for. Your mileage may vary though and depending on the rest of the code it may be a negligible difference. but if you need every clock cycle reverse loops are awesome.
Upvotes: 2
Reputation: 9134
Testing againt 0 versus the maxndex is likely to have little effect. However, performing test1 then test2 shortly thereafter often has an effect due to processor caching and/or O/S page caching. You might reverse the test1/test2 to see if forwards magically becomes faster than backwards. Accurate profiling is hard on modern achitectures.
OK, so Backwards is still Faster when executed first. Not my first guess, but since you are using Parallel and lock, there is perhaps an interaction between the locking method and the difference between forward and backward looping.
Maybe the backwards loop just happens to work better with processor branch prediction (again may interact with parallism, processor cache, etc.).
Lots of tight loops in multi-thread code have weird interactions with memory management because of locking overhead. -- It is not even uncommon for the multi-thread solution to be slower because of lock contention
You can try running without parallel both forward and back to see if time becomes more even -- but at best you would only determine it to be likely / unlikely relation to parallel interactions or lock contention. Profiling your code may be suggestive, but it may may not procide a definite answer either. A definitive answer can be quite difficult for this situation (I assumed you were mostly in curious/learning mode).
Upvotes: 0