Inconsistent Timings Between Cache Flushing Routines

Question

I need to run timings of a DAXPY linear algebra kernel. Naively, I would want to try something like this:

fill(X,operand_size);
fill(Y,operand_size);
double seconds = timer();
daxpy(alpha,X,Y,operand_size);
seconds = timer() - seconds;

The full code link is at the end if desired.

The problem is, the memory access to fill the operands x and y will result in them being placed in the processor cache. The subsequent access to the memory in the DAXPY call is therefore much faster than it actually would be in a production run.

I compare two ways of getting around this. The first way is to flush the operands from all levels of cache via the clflush instruction. The second method is to read a very large array until the operand entries are "naturally" evicted from the cache. I test both, and this is the runtime for a single DAXPY call with operands of size 2048:

Without flush: time=2840 ns
With clflush: time=4090 ns
With copy flush: time=5919 ns

Here is another run made a few seconds later:

Without flush: time=2717 ns
With clflush: time=4121 ns
With copy flush: time=4796 ns

As expected, the flush increases the runtime. However, I do not understand how the copy flush could result in a drastically longer runtime for the DAXPY routine. The clflush instruction should evict the operand from ALL caches, so the time with a clflush should be an upper bound on the execution time with any other cache flushing procedure. Not only that, but the flushed timings (for both methods) also bounce around a lot (hundreds of nanoseconds vs. less than 10 nanoseconds for the unflushed case). Does anyone know why the manual flush would have such a drastic difference in runtime?

Appendix

The full code, with all timing routines and flush routines, is here (194 lines):

http://codepad.org/hNJpQxTv

This is my gcc version. The code is compiled with the -O3 option. (I know, it's old; some software I have to build is incompatible with newer gcc)

Using built-in specs.

Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5646.1~2/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5646) (dot 1)

I am on Mac OS X 10.6 with an Intel Xeon 5650 processor.

Brendan · Accepted Answer

The pseudo-code for (part of) what the CPU actually does for a normal read (that would occur a lot while your benchmarking) might be:

if( cache line is not in cache ) {
    if(cache is full) {
        find cache line to evict
        if( cache line to evict is in "modified" state ) {
            write cache line to evict to slower RAM
        }
        set cache line to "free"
    }
    fetch the cache line from slower RAM 
}
temp = extract whatever we're reading from the cache line

If you used CLFLUSH to flush the cache then the if(cache is full) will be false because CLFUSH leaves the cache empty.

If you used copying to flush the cache then the if(cache is full) branch will be true, and the if( cache line is modified ) will also be true half of the time (half the cache will contain data you read during copying and the other half will contain data you wrote during copying). This means that half the time you end up doing the write cache line to evict to slower RAM.

Doing the write cache line to evict to RAM will consume RAM chip bandwidth and effect the performance of fetch the cache line from slower RAM.

Inconsistent Timings Between Cache Flushing Routines

Answers (1)

Related Questions