Reputation: 1061
Just playing around with some C# code and found that the time taken to scan an in memory array depends on the object size.
Let me explain, for two collections same length but different object size, the time taken to loop over is bigger for big objects.
Testing with Linqpad:
SimpleObject
objects looping over all takes ~221 msBigObject
objects looping over all takes ~756 msWhy time is not close to constant ? Should it not be using kind of
pointer arithmetic ?
Thanks
public class SmallObject{
public int JustAnInt0;
public static SmallObject[] FakeList(int size){
var res = new SmallObject[size];
for(var c = 0; c != size; ++c)
res[c] = new SmallObject();
return res;
}
}
public class MediumObject{
public int JustAnInt0;
public int JustAnInt1;
public int JustAnInt2;
public int JustAnInt3;
public int JustAnInt4;
public static MediumObject[] FakeList(int size){
var res = new MediumObject[size];
for(var c = 0; c != size; ++c)
res[c] = new MediumObject();
return res;
}
}
public class BigObject{
public int JustAnInt0;
public int JustAnInt1;
public int JustAnInt2;
public int JustAnInt3;
public int JustAnInt4;
public int JustAnInt5;
public int JustAnInt6;
public int JustAnInt7;
public int JustAnInt8;
public int JustAnInt9;
public int JustAnInt10;
public int JustAnInt11;
public int JustAnInt12;
public int JustAnInt13;
public int JustAnInt14;
public int JustAnInt15;
public int JustAnInt16;
public int JustAnInt17;
public int JustAnInt18;
public int JustAnInt19;
public static BigObject[] FakeList(int size){
var res = new BigObject[size];
for(var c = 0; c != size; ++c)
res[c] = new BigObject();
return res;
}
}
void Main()
{
var size = 30000000;
var small = SmallObject.FakeList(size);
var medium = MediumObject.FakeList(size);
var big = BigObject.FakeList(size);
var sw = System.Diagnostics.Stopwatch.StartNew();
for(var c = 0; c != size; ++c){
small[c].JustAnInt0++;
}
string.Format("Scan small list took {0}", sw.ElapsedMilliseconds).Dump();
sw.Restart();
for(var c = 0; c != size; ++c){
medium[c].JustAnInt0++;
}
string.Format("Scan medium list took {0}", sw.ElapsedMilliseconds).Dump();
sw.Restart();
for(var c = 0; c != size; ++c){
big[c].JustAnInt0++;
}
string.Format("Scan big list took {0}", sw.ElapsedMilliseconds).Dump();
}
// Define other methods and classes here
UPDATE:
In this case @IanMercer comment, plus @erisco, pointed me in the right way, so after tweaking a bit the objects I get the expected behavior. Basically what I did is wrap extra data into an object. In this way small, medium and big have more or less the same size able to fit with CPU caches. Now the test shows same times.
public class SmallObject{
public int JustAnInt0;
public static SmallObject[] FakeList(int size){
var res = new SmallObject[size];
for(var c = 0; c != size; ++c)
res[c] = new SmallObject();
return res;
}
}
public class MediumObject{
public int JustAnInt0;
public class Extra{
public int JustAnInt1;
public int JustAnInt2;
public int JustAnInt3;
public int JustAnInt4;
}
public Extra ExtraData;
public static MediumObject[] FakeList(int size){
var res = new MediumObject[size];
for(var c = 0; c != size; ++c)
res[c] = new MediumObject();
return res;
}
}
public class BigObject{
public int JustAnInt0;
public class Extra{
public int JustAnInt1;
public int JustAnInt2;
public int JustAnInt3;
public int JustAnInt4;
public int JustAnInt5;
public int JustAnInt6;
public int JustAnInt7;
public int JustAnInt8;
public int JustAnInt9;
public int JustAnInt10;
public int JustAnInt11;
public int JustAnInt12;
public int JustAnInt13;
public int JustAnInt14;
public int JustAnInt15;
public int JustAnInt16;
public int JustAnInt17;
public int JustAnInt18;
public int JustAnInt19;
}
public Extra ExtraData;
public static BigObject[] FakeList(int size){
var res = new BigObject[size];
for(var c = 0; c != size; ++c)
res[c] = new BigObject();
return res;
}
}
void Main()
{
var size = 30000000;
var small = SmallObject.FakeList(size);
var medium = MediumObject.FakeList(size);
var big = BigObject.FakeList(size);
var times = Enumerable
.Range(0, 10)
.Select(r => {
var sw = System.Diagnostics.Stopwatch.StartNew();
for(var c = 0; c != size; ++c){
small[c].JustAnInt0++;
}
// string.Format("Scan small list took {0}", sw.ElapsedMilliseconds).Dump();
var smalltt = sw.ElapsedMilliseconds;
sw.Restart();
for(var c = 0; c != size; ++c){
big[c].JustAnInt0++;
}
// string.Format("Scan big list took {0}", sw.ElapsedMilliseconds).Dump();
var bigtt = sw.ElapsedMilliseconds;
sw.Restart();
for(var c = 0; c != size; ++c){
medium[c].JustAnInt0++;
}
//string.Format("Scan medium list took {0}", sw.ElapsedMilliseconds).Dump();
var mediumtt = sw.ElapsedMilliseconds;
return new {
smalltt,
mediumtt,
bigtt
};
})
.ToArray();
(new {
Small = times.Average(t => t.smalltt),
Medium = times.Average(t => t.mediumtt),
Big = times.Average(t => t.bigtt)
}).Dump();
}
Some useful links:
Thank you all!
Upvotes: 0
Views: 175
Reputation: 14329
My answer is pure speculation, but hopefully it offers some things to test and rule out.
public static SmallObject[] FakeList(int size){
var res = new SmallObject[size];
for(var c = 0; c != size; ++c)
res[c] = new SmallObject();
return res;
}
FakeList
allocates many objects one after the other and stores them in an array. The allocator will store all these objects contiguously. In generational GC, allocation is done by pointer bumping (there is no search for free spaces) (read here).
Lets say the overhead for an object is 16 bytes. From that lets guess the size of SmallObject
is 20 bytes, MediumObject
is 36 bytes, and BigObject
is 96 bytes.
So, we have three arrays of objects stored contiguously. When the CPU fetches an int, 4 bytes, it also fetches a bunch of memory adjacent to the int (read on CPU cache and cache lines). Lets say the CPU fetches 64 bytes at a time.
How many objects fit into a cache line?
0 20 40 60 84
| SmallObject | SmallObject | SmallObject | SmallObject |
0 36 72
| MediumObject | MediumObject |
0 96
| BigObject |
Note: we're not considering data alignment here.
A cache line fits 3.2 SmallObjects, 1.77 MediumObjects, and 0.66 BigObjects.
We are incrementing JustAnInt0
in the loop, which happens to be the first field of the object. The compiler probably lays out the fields in the same order as you declared them (because they are all ints, otherwise maybe it reorders them for memory alignment purposes).
Taking this into account, lets say JustAnInt0
is bytes 16 to 20 in all of SmallObject, MediumObject, and BigObject. This means we can fetch 3 JustAnInt0
's from SmallObjects at once, 2 JustAnInt0
's from MediumObject at once, and only 1 JustAnInt0
from BigObject at once.
Therefore, the reason you can increment JustAnInt0
the fastest on the SmallObject
array is because the CPU can load three JustAnInt0
's into its local cache at once. This means there is a third of the main memory accesses required compared to BigObject
. Main memory access is an order of magnitude to two orders of magnitude slower than CPU cache access (read here). Main memory access are among the slowest instructions for your CPU and may dominate the total time cost of an algorithm.
Again, this is all complete speculation. The only real way to know is to understand your hardware and do some testing. Hopefully this offers a reference point to begin that investigation.
Upvotes: 2
Reputation:
Like the other answer says, its because of CPU caching and other optimizations.
Smaller arrays: level 1 cache (very fast)
Larger arrays: level 2 cache (fast)
Huge arrays: not cached (normal)
Gigantic arrays: paged to disk (slow)
See this simple explanation.
Upvotes: 0
Reputation: 726809
Should it not be using kind of pointer arithmetic?
While CLR indeed uses "kind of pointer arithmetic" to locate the item in memory, what happens next is different: once you start accessing JustAnInt0
s, CLR starts reading data from these pointers.
This is where it gets messy: modern hardware is heavily optimized for cache, so when you request JustAnInt0
, hardware predicts that JustAnInt1
, JustAnInt2
, and so on, are going to follow, because for most real-life programs it does. This is called locality of reference. The number of items that get loaded along with JustAnInt0
is dependent on the size of cache line in your hardware. When the object is small, and cache line is large, an object or two in adjacent memory regions may be loaded as well.
It appears that when the object is small, your program inadvertently takes advantage of locality of reference, because multiple small objects end up in cache when you access small[c]
.
This behavior relies on small objects being allocated next to each other, too. If you apply random shuffle to small
, medium
, and big
, access times should become a lot closer.
Upvotes: 5