samkit
samkit

Reputation: 79

Cache Optimizations for adding 2 long vector

Given 2 long vectors 2000 element each are to be added on machine with 32 byte cache line (single level cache) and a CPU. We have to add these 2 vectors such that sum goes in a new vector.
e.g. c[0]=a[0]+b[0], c[1]=a[1]+b[1], c[2]=a[2]+b[2]......... c[1999]=a[1999]+b[1999]

I know when c[0]=a[0]+b[0] is done we will have a[0]to a[31], b[0]to b[31], c[0]to c[31] in cache . So we will get a cache miss at every 32nd element. Somebody asked me this:

Can you optimize it more for getting better performance (over what I stated above. Cache miss only at 32 element because of locality)?

I am sure there is something more to this that I don't know.

Upvotes: 1

Views: 290

Answers (1)

Paul R
Paul R

Reputation: 213029

Assuming a modern superscalar CPU with out-of-order execution, you can use a technique called software pipelining to help mitigate the cost of cache misses. E.g.

for (i = 0; i < N; ++i)
{
    c[i] = a[i] + b[i];
}

becomes:

ai = a[0];
bi = b[0];
ci = ai + bi;
ai = a[1];
bi = b[1];
for (i = 0; i < N - 2; ++i)
{
    c[i] = ci;           // note that within this loop the order of operations has
    ci = ai + bi;        // been reversed - instead of load-add-store we now have
    ai = a[i + 2];       // store-add-load - this reduces serial dependencies
    bi = b[i + 2];
}
c[i] = ci;
ci = ai + bi;
c[i + 1] = ci;

Typically a total cache miss costs 100s of cycles (DRAM latency), so in this simple case the overlapping of loads/stores and arithmetic will make only a very small difference, but for complex examples software pipelining can sometimes be useful.

Having said that, most modern CPUs now have automatic (hardware) prefetch, so software pipelining has become less useful than it used to be. Also many of these explicit optimisations are now handled automatically by good compilers.

Upvotes: 1

Related Questions