Reputation: 16168
template<size_t size>
class Objects{
std::array<int,size> a;
std::array<int,size> b;
std::array<int,size> c;
void update(){
for (size_t i = 0; i < size; ++i){
c[i] = a[i] + b[i];
}
}
};
I am gathering information of how to write cache friendly code since a week and I read though several articles but I still haven't understood the basics.
Code like I have written above is used in most of the examples, but for me this is not cache friendly at all.
For me the memory layout should look like this
aaaabbbbcccc
and in the first loop it will access
[a]aaa[b]bbb[c]ccc
If I understood it correctly, the cpu prefetches elements that are near in memory. I am not sure how intelligent this method is but I assume it's primitive and it just fetches the nth nearest elements.
The problem is that [a]aaa[b]bbb[c]ccc
will not access the elements in order at all. So it might fetch the next '3' elements a[aaa]bbbbcccc
which is nice for the next a because it will be a cache hit but not for the b.
Is the example above cache friendly code?
Upvotes: 0
Views: 2465
Reputation: 1088
Your code is not particularly unfriendly. It requires three active cache lines at a time instead of one, but that isn't too much to ask. Your code would be a lot more cache-unfriendly if instead of
std::array<int,size> a;
you had
std::array<struct { int x; char description[5000]; }, size> a;
because then the CPU would have to pick out the lone x
among the thousands of bytes of description
(which your loop never uses).
Your example would also be more cache-unfriendly if you had not just a
, b
, and c
, but also d
-z
and aa
-az
and maybe a few more. (How far you have to go depends on the sophistication of your cache - how many way-associative it is, etc.)
Have you profiled yours vs Thomas Matthews' code?
Upvotes: 1
Reputation: 1
You should trust the compiler optimization work (and of course enable optimizations); it probably deals quite well with the CPU cache (perhaps by issuing appropriate prefetch instructions).
Sometimes you can hint the compiler thru builtins or pragmas. For example with GCC on x86-64 you might -with care- use the __builtin_prefetch. Usually it is not worth the effort (and if you misuse it performance will suffer). See this answer to a related question.
Upvotes: 0
Reputation: 57749
I suggest you use an array of structures:
struct Cache_Item
{
int a;
int b;
int c;
};
Cache_Item cache_line[size];
for (unsigned int i = 0; i < size; ++i)
{
cache_line[i].c = cache_line[i].a + cache_line[i].b;
}
The structure arrangement allows for all the variables in use to be next to each other in the cache line or very close.
In your array method, element b[0] ideally is at location a[size], so they are size
items apart. This could mean that they are on different cache lines. The result location, c[0], is at a[size + size], which means it could be 2 cache lines away.
Upvotes: 2