user3639557
user3639557

Reputation: 5271

Flushing the cache to prevent benchmarking fluctiations

I am running the c++ code of someone to do the benchmarking on a dataset. The issue I have is that often I get a timing for the first run, and these numbers massively change (i.e. 28 seconds to 10 seconds) if I run the same code again. I assume this happens due to CPU's automatic caching. Is there a way to flush the cache, or prevent these fluctuations somehow?

Upvotes: 12

Views: 10878

Answers (2)

BeeOnRope
BeeOnRope

Reputation: 64875

Here's my basic approach:

  1. Allocate a memory region 2x the size of the LLC, if you can determine the LLC size dynamically (or you know it statically), or if you don't, some reasonable multiple of the largest LLC size on the platform of interest1.
  2. memset the memory region to some non-zero value: 1 will do just fine.
  3. "Sink" the pointer somewhere so that the compiler can't optimize out the stuff above or below (writing to a volatile global works pretty much 100% of the time).
  4. Read from random indexes in the region until you've touched each cache line an average of 10 times or so (accumulate the read values into a sum that you sink in a similar way to (3)).

Here are some notes on why this is generally works and why doing less may not work - the details are x86-centric, but similar concerns will apply on many other architectures.

  • You absolutely want to write to the allocated memory (step 2) before you begin your main read-only flushing loop, since otherwise you might just be repeatedly reading from the same small zero-mapped page returned by the OS to satisfy your memory allocation.
  • You want to use a region considerably larger than the LLC size, since the outer cache levels are typically physically addressed, but you can only allocate and access virtual addresses. If you just allocate an LLC-sized region, you generally won't get full coverage of all the ways of every cache set: some sets will be over-represented (and so will be fully flushed), while other sets be under-represented and so not all existing values can even be flushed by accessing this region of memory. A 2x over-allocation makes it highly likely that almost all sets have enough representation.
  • You want to avoid the optimizer doing clever things, such as noting the memory never escapes the function and eliminating all your reads and writes.
  • You want to iterate randomly around the memory region, rather than just striding through it linearly: some designs like the LLC on recent Intel detect when a "streaming" pattern is present, and switch from LRU to MRU since LRU is about the worst-possible replacement policy for such a load. The effect is that no matter how many times you stream though memory, some "old" lines from before your efforts can remain in the cache. Randomly accessing memory defeats this behavior.
  • You want to access more than just LLC amount of memory for (a) the same reason you allocate more than the LLC size (virtual access vs physical caching) and (b) because random access needs more accesses before you have a high likelihood of hitting every set enough times (c) caches are usually only pseudo-LRU, so you need more than the number of accesses you'd expect under exact-LRU to flush out every line.

Even this is not foolproof. Other hardware optimizations or caching behaviors not considered above could cause this approach to fail. You might get very unlucky with the page allocation provided by the OS and not be able to reach all the pages (you can largely mitigate this by using 2MB pages). I highly recommend testing whether your flush technique is adequate: one approach is to measure the number of cache misses using CPU performance counters while running your benchmark and see if the number makes sense based on the known working-set size2.

Note that this leaves all levels of the cache with lines in E (exclusive) or perhaps S (shared) state, and not the M (modified) state. This means that these lines don't need to be evicted to other cache levels when they are replaced by accesses in your benchmark: they can simply be dropped. The approach described in the other answer will leave most/all lines in the M state, so you'll initially have 1 line of eviction traffic for every line you access in your benchmark. You can achieve the same behavior with my recipe above by changing step 4 to write rather than read.

In that regard, neither approach here is inherently "better" than the other: in the real world the cache levels will have a mix of modified and not-modified lines, while these approaches leave the cache at the two extremes of the continuum. In principle you could benchmark with both the all-M and no-M states, and see if it matters much: if it does, you can try to evaluate what the real-world state of the cache will usually be an replicate that.


1Remember that LLC sizes are growing almost every CPU generation (mostly because core counts are increasing), so you want to leave some room for growth if this needs to be future-proof.

2 I just throw that out there as if it was "easy", but in reality may be very difficult depending on your exact problem.

Upvotes: 11

Mats Petersson
Mats Petersson

Reputation: 129314

Not one that works "for everything, everywhere". Most processors have special instructions to flush the cache, but they are often privileged instructions, so it has to be done from inside the OS kernel, not your user-mode code. And of course, it's completely different instructions for each processor architecture.

All current x86 processors does have a clflush instruction, that flushes one cache-line, but to do that, you have to have the address of the data (or code) you want to flush. Which is fine for small and simple data structures, not so good if you have a binary tree that is all over the place. And of course, not at all portable.

In most environments, reading and writing a large block of alternative data, e.g. something like:

// Global variables.
const size_t bigger_than_cachesize = 10 * 1024 * 1024;
long *p = new long[bigger_than_cachesize];
...
// When you want to "flush" cache. 
for(int i = 0; i < bigger_than_cachesize; i++)
{
   p[i] = rand();
}

Using rand will be much slower than filling with something constant/known. But the compiler can't optimise the call away, which means it's (almost) guaranteed that the code will stay.

The above won't flush instruction caches - that is a lot more difficult to do, basically, you have to run some (large enough) other piece of code to do that reliably. However, instruction caches tend to have less effect on overall benchmark performance (instruction cache is EXTREMELY important for modern processor's perforamnce, that's not what I'm saying, but in the sense that the code for a benchmark is typically small enough that it all fits in cache, and the benchmark runs many times over the same code, so it's only slower the first iteration)

Other ideas

Another way to simulate "non-cache" behaviour is allocate a new area for each benchmark pass - in other words, not freeing the memory until the end of the benchmark or using an array containing the data, and output results, such that each run has it's own set of data to work on.

Further, it's common to actually measure the performance of the "hot runs" of a benchmark, not the first "cold run" where the caches are empty. This does of course depend on what you are actually trying to achieve...

Upvotes: 18

Related Questions