Reputation: 1220
I am using the following code to test the effect of flushing the cache after initialization on the runtime a daxpy routine (the full code with the fill() and wall_time() routines is here: http://codepad.org/QuLT3cbD - it is 150 lines):
#define KB 1024
int main()
{
int cache_size = 32*KB;
double alpha = 42.5;
int operand_size = cache_size/(sizeof(double)*2);
double* X = new double[operand_size];
double* Y = new double[operand_size];
//95% confidence interval
double max_risk = 0.05;
//Interval half width
double w;
int n_iterations = 100;
students_t dist(n_iterations-1);
double T = boost::math::quantile(complement(dist,max_risk/2));
accumulator_set<double, stats<tag::mean,tag::variance> > unflushed_acc;
for(int i = 0; i < n_iterations; ++i)
{
fill(X,operand_size);
fill(Y,operand_size);
double seconds = wall_time();
daxpy(alpha,X,Y,operand_size);
seconds = wall_time() - seconds;
unflushed_acc(seconds);
}
w = T*sqrt(variance(unflushed_acc))/sqrt(count(unflushed_acc));
printf("Without flush: time=%g +/- %g ns\n",mean(unflushed_acc)*1e9,w*1e9);
//Using clflush instruction
//We need to put the operands back in cache
accumulator_set<double, stats<tag::mean,tag::variance> > clflush_acc;
for(int i = 0; i < n_iterations; ++i)
{
fill(X,operand_size);
fill(Y,operand_size);
flush_array(X,operand_size);
flush_array(Y,operand_size);
double seconds = wall_time();
daxpy(alpha,X,Y,operand_size);
seconds = wall_time() - seconds;
clflush_acc(seconds);
}
w = T*sqrt(variance(clflush_acc))/sqrt(count(clflush_acc));
printf("With clflush: time=%g +/- %g ns\n",mean(clflush_acc)*1e9,w*1e9);
return 0;
}
When I run this code, it reports these numbers for the rate and uncertainty therein (at the 95% confidence level):
Without flush: time=3103.75 +/- 0.524506 ns With clflush: time=4651.72 +/- 201.25 ns
Why does flushing the operands X and Y from cache with clflush increase the noise in the measurements by over 100x?
Upvotes: 3
Views: 273
Reputation: 11582
At 3GHz .52 ns is 1.5 CPU cycles and 201.25 ns is 604 CPU cycles... considering it takes a couple of hundred CPU cycles or more to read a cache line from DRAM when it misses in the cache hierarchy you are measuring the variance caused by 1 or 2 cache line misses... which isn't much. In the unflushed case you have a very tight reading of mean time with very tight variance... no cache misses are happening.
I found that by increasing the number of iterations to around 5000 on my Mac (you have it at 100 iterations) I could get as tight a reading on the variance for the flushed case as with the unflushed case. It stands to reason that the mean time when the data is not in the cache is larger than when it is all in the cache - but it isn't as slow as you might expect - that is because the CPU is very effective at predicting in your case the access pattern and is (speculatively) pre-fetching data ahead of your anticipated use (many cache lines ahead, in fact, to hide the relatively long latency of going to DRAM).
The CPU may issue several data cache (and instruction cache) prefetches ahead of the anticipated use... by having many memory reads in flight concurrently it effectively reduces the memory latency (provided it guesses correctly of course). This introduces non-determinism. In your un-flushed case practically all data is in the Level 1 data cache - the stack memory references are in addition to your 32KB of data, so that overflows the L1 data cache, but that is not a lot and will quickly be filled from the Level 2 cache - the point is there is no need to go to the memory controller/DRAM. In the flushed case your data is only in memory, so you get variability based on how the processor prefetching contends for the memory controller (data and instruction) and whatever is going on in other cores that share the same memory controller. By running it longer you let the system settle into a "normal" mode for that access pattern and variance goes down.
Upvotes: 5