Reputation: 79
Given 2 long vectors 2000 element each are to be added on machine with 32 byte cache line (single level cache) and a CPU. We have to add these 2 vectors such that sum goes in a new vector.
e.g. c[0]=a[0]+b[0], c[1]=a[1]+b[1], c[2]=a[2]+b[2]......... c[1999]=a[1999]+b[1999]
I know when c[0]=a[0]+b[0]
is done we will have a[0]to a[31], b[0]to b[31], c[0]to c[31]
in cache . So we will get a cache miss at every 32nd element. Somebody asked me this:
Can you optimize it more for getting better performance (over what I stated above. Cache miss only at 32 element because of locality)?
I am sure there is something more to this that I don't know.
Upvotes: 1
Views: 290
Reputation: 213029
Assuming a modern superscalar CPU with out-of-order execution, you can use a technique called software pipelining to help mitigate the cost of cache misses. E.g.
for (i = 0; i < N; ++i)
{
c[i] = a[i] + b[i];
}
becomes:
ai = a[0];
bi = b[0];
ci = ai + bi;
ai = a[1];
bi = b[1];
for (i = 0; i < N - 2; ++i)
{
c[i] = ci; // note that within this loop the order of operations has
ci = ai + bi; // been reversed - instead of load-add-store we now have
ai = a[i + 2]; // store-add-load - this reduces serial dependencies
bi = b[i + 2];
}
c[i] = ci;
ci = ai + bi;
c[i + 1] = ci;
Typically a total cache miss costs 100s of cycles (DRAM latency), so in this simple case the overlapping of loads/stores and arithmetic will make only a very small difference, but for complex examples software pipelining can sometimes be useful.
Having said that, most modern CPUs now have automatic (hardware) prefetch, so software pipelining has become less useful than it used to be. Also many of these explicit optimisations are now handled automatically by good compilers.
Upvotes: 1