Reputation: 905
I understand, that spatial and temporal locality have an enourmous impact on performance. What I don't understand is how my data structures are stored in these caches? For simplicity, assume the L1 cache has 8 bytes, the L2 16 and the L3 32 bytes. Does that mean that if we have:
std::array<double, 1> x = {1.};
std::array<double, 2> y = {1.,2.};
std::array<double, 4> z = {1.,2.,3.,4.};
And some function calls these arrays, will x be loaded in the L1 cache, y in L2 and z in L3? Or will y - for example be splitted over the L1 and L2 caches??
Will splitting these arrays manualy yield better cache localitly? If I do for example something like:
std::array<std::array<double,2>,2> z;
Will z be splitted across the cache levels when a function calls it?
What about cachelines? these are usualy 64 bytes long - Will splitting my arrays into arrays of arrays of 64 bytes yield better access speed?
std::array<std::array<double,8>,2> u;
I find this subject quite confusing and would appreciate any help
Upvotes: 4
Views: 899
Reputation: 11968
You are thinking about the caches the wrong way.
You can only see which cache has them with special tools (intel debugger comes to mind) and the results will be specific to your run and architecture. Changing the processor can break your setup fairly easy.
That said you can try to have use solutions that are cache friendly.
The way the caches work is this: Say you want to read you x[0]
. Your program will make a request for the memory location associated with it. It will be intercepted by L1. If L1 can give you the value (because it's in a block that it stores already) it will. If not the request will be intercepted by L2 and so on. If no cache levels has that block it will be requested from RAM.
Now, it's inefficient to read just 4 bytes from RAM because there's overhead. So actually you're going to read a L3 block from ram, that includes the bytes you want. It can happen that you might have to read 2 blocks because your data is split between them (compilers try to avoid this). The the L2 block sized chunk is sent to L2 cache to be stored and a L1 sized chunk to L1, all including the bytes you want (the bytes might be in the middle somewhere). For the next request (say 'x[1]') the same thing happens. If the next access was close to the last one then you'll likely get the result from L1. I say likely because your program might have been suspended and resumed on a different core or processor which has a different L1.
Trying to design for a specific setup is usually a bad idea (unless you really need that last few % of performance and you've already tried everything else).
The rule of thumb is to keep accessing memory that's next to each other. The thing to avoid is accessing few bytes that are far apart. Going through an array is very fast. Try to implement a linear search and a binary search over the same sorted array and see how long the array needs to be before you are getting significant better performance out of the binary search (last time I went to about >100 ints).
In your example, if you access first all elements of x
then move on to y
and so on the setup is good. If instead you are accessing x[i], y[i], z[i]
then x[i+1], y[i+1], z[i+1]
then maybe having a struct with {x,y,z} and having that in an array would be better (you need to benchmark to know for sure).
And some function calls these arrays, will x be loaded in the L1 cache, y in L2 and z in L3? Or will y - for example be splitted over the L1 and L2 caches??
They will all be in all L1, L2, L3 caches loaded as you access them. If you access often enough you get them from a lower level cache.
Will splitting these arrays manualy yield better cache localitly?
No. The processor's memory management handles the splits. The cache locality depends on how often you access a particular part of memory. It's better to have all the accesses bunched up rather than spread out over time.
What about cachelines? these are usualy 64 bytes long - Will splitting my arrays into arrays of arrays of 64 bytes yield better access speed?
No. Likely you won't see any difference. The arrays are split automatically by the memory management stuff in the processor. And again, don't over-optimize for you current processor architecture, the CPU you buy tomorrow might have twice as long cache-lines out of the box.
Upvotes: 4