Reputation: 2119
I expect the following program to be completely memory-bound in terms of performance (the arrays are way bigger than the L3-cache).
Therefor I expected the sum of the long array to take almost twice as much time as the sum of the int array.
But they take both almost the same time:
int sum took 81 ms, result = 4999999950000000
long sum took 87 ms, result = 4999999950000000
Can anybody explain this?
using System;
using System.Diagnostics;
using System.Linq;
namespace MemoryPerformance
{
class Program
{
static void Main(string[] args)
{
const int count = 100_000_000;
int[] intArray = Enumerable.Range(0, count).ToArray();
long[] longArray = intArray.Select(x => (long)x).ToArray();
Measure(() => intSum(intArray), " int sum");
Measure(() => longSum(longArray), "long sum");
}
static long intSum(int[] array)
{
long sum = 0;
for (int i = 0; i < array.Length; i++) sum += array[i];
return sum;
}
static long longSum(long[] array)
{
long sum = 0;
for (int i = 0; i < array.Length; i++) sum += array[i];
return sum;
}
static void Measure(Func<long> calc, string description)
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
long sum = calc();
stopwatch.Stop();
Console.WriteLine($"{description} took {stopwatch.ElapsedMilliseconds} ms, result = {sum}");
}
}
}
Upvotes: 2
Views: 242
Reputation: 8005
The time you are measuring is mainly "just the CPU time". If you only sum up the numbers and omit the whole memory access like in this fork of Harolds answer, you will see that it takes almost the same amount of time to simply add up all the numbers in a loop, without reading them from the array/memory:
static long noSum(long[] array)
{
long sum = 0;
for (int i = 0; i < array.Length; i ++) sum += i;
return sum;
}
This means that even though the CPU has to fetch the data from the memory and cannot keep it all in the cache, it can do this very efficiently because you are not using random array access: In the case of your loops it has plenty of time to prefetch the next cache line while it is still performing the computations. This results in almost no waiting time (speculative execution anyone?! ;-) ). Hence it doesn't matter in your case. It does in those cases where you need to access a lot of memory more quickly like in Harolds "sparse" test-case, obviously.
Upvotes: 3
Reputation: 64904
If I run this several times I get about the same results, but worse: (printing of the result added just in case)
int sum took 84 ms (4999999950000000)
long sum took 77 ms (4999999950000000)
int sum took 84 ms (4999999950000000)
long sum took 76 ms (4999999950000000)
int sum took 84 ms (4999999950000000)
long sum took 77 ms (4999999950000000)
int sum took 83 ms (4999999950000000)
long sum took 76 ms (4999999950000000)
So with longs it's faster? One this that it is not doing that the int
version is, is sign extension. It could be that. Actually I don't know what else it could be.
But this is just what happens when all elements of the array are added. If I take only every 8th element (8th because cache lines are 64 bytes and longs are 8, so 8 fit in a cache line), this happens:
int sum took 25 ms (624999950000000)
long sum took 49 ms (624999950000000)
int sum took 23 ms (624999950000000)
long sum took 49 ms (624999950000000)
int sum took 23 ms (624999950000000)
long sum took 48 ms (624999950000000)
int sum took 23 ms (624999950000000)
long sum took 48 ms (624999950000000)
This is very different, with indeed the int
version being about twice as fast as the long
version, corresponding with the number of cache misses expected for both versions.
So I may conclude from this that in the "full" version apparently enough arithmetic (or at least "stuff that isn't the memory access", including loop overhead) happens to hide most of the cache miss penalty, combined with actually doing less work in the long
version.
Also I think we should keep in mind that since this is a completely linear access pattern, it should be expected that hardware prefetching does a good job. The throughput of automatic prefetching may or may not be adequate, but it shouldn't be that bad of a case - it is not unreasonable that doing a bit of computation allows prefetching to "catch up".
Relevant code that uses only every 8th element:
static long intSum(int[] array)
{
long sum = 0;
for (int i = 0; i < array.Length; i += 8) sum += array[i];
return sum;
}
static long longSum(long[] array)
{
long sum = 0;
for (int i = 0; i < array.Length; i += 8) sum += array[i];
return sum;
}
Timing on ideone
Upvotes: 2