Reputation: 4233

A micro benchmark for memory

I'm trying to write a tiny micro benchmark in C to test memory.

I believe that the cache size on my machine (Intel i5) is 8MB..

Could someone suggest some logic to test memory while ensuring cache miss rate of 100%?

array1 = malloc(DCACHE_SIZE);
array2 = malloc(DCACHE_SIZE);
while(condition)
    memcpy(&array1[index], &array2[index], sizeof(char));
    index++;

Currently, using memcpy, my program hits 420,782,149 calls to memcpy a second.. I think there is something seriously wrong with this number (its hitting cache a LOT)

How do I avoid cache?

Upvotes: 2

Answers (4)

wildplasser

Reputation: 44250

A simple way to force cache misses is to hop between areas that are guaranteed to be in different cache windows, like:

#include <string.h>
#define DCACHE_SIZE (1024*1024*8)

void dummy(){
char *array1, *array2;
size_t index, count;

array1 = malloc(5*DCACHE_SIZE);
array2 = malloc(5*DCACHE_SIZE);
for(index=0,count=54321;count--; index = (index+3) % (5*DCACHE_SIZE)) {
    memcpy(&array1[index], &array2[index], 1);
    }
}

The 3 and 5 above are chosen arbitrarily (but should be relatively prime); 1 and 2 would also suffice to jump out of the cache on each iteration. Also note that the source and destination of the memcpy() are also in different cache slots, so with fewer than 2 cache slots, this code would also cause two cache misses on every iteration of the loop. BTW: on my machine, GCC replaces the memcpy() call by inlined instructions.

Upvotes: 1

Leeor

Reputation: 19736

Disabling the caches as suggested above is quite complicated, instead, you could use data manipulation methods that avoid them altogether.

The best way would be to define an uncacheable memory region, that way every read/write would go immediately to memory and skip filling the cache, but that too would require tweaking your program in a more advanced level.

The simplest solution I can think of is directly using streaming/non-temporal instructions that skip the cache - try the _mm_stream_si64 / _mm_stream_si32 intrinsics if your compiler recognizes them, or use the movnt* assmebly instruction family directly in an inline assembly section - it should have almost the same effect on your processor. Note that they manipulate larger elements than a single byte, so you might have to rearrange your code a bit

Upvotes: 1

Albert S

Reputation: 2602

I would also turn off prefetching, if you are not disabling caches.
Furthermore run your test in a loop at least 10 times and note the results.
Destroy and recreate the arrays inside the for loop and see how times differ if you just allocated the arrays before the for loop.

On your 420M result: That's about 420 MB/s copied (read and written). Depending on your RAM speed that seems to be a low number.
You can also take a look at the Stream benchmark from University of Virginia to compare .

Upvotes: 0

slowjelj

Reputation: 702

I am not familiar specifically with the Intel i5 caching architecture but there are two basic approaches that should work with most processors:

Disable L1/L2/L3 cache for your memory buffers. This is likely the only true way to ensure the caches are not used. A variation of this is to lock the contents of some other unused memory area into the caches (i.e. if disabling is not an option).
If the first approach is not an option, make your arrays much, much larger than your DCACHE size and memcpy() over that region. The idea here is that the caches will be used but will be flushed as new portions of the large array are pulled into cache. This should give a benchmark pretty close to what you would get going directly from CPU to memory. If you use memset() instead of memcpy() and your caches are Write-Through, this benchmark should be identical to the direct CPU-to-memory path.

In both of those cases, for more precise results you should ensure the contents of array1[] and array2[] are not already in cache before starting the test. This might require allocating and filling (or simply reading) a third buffer just prior to the memcpy() test. There are many of these types of gotchas when trying to avoid the caches and how to work around and avoid them is specific to the cache architecture and how the caches are configured by your OS (i.e. if it is Linux, by default it probably won't configure the caches as write-through).

BTW, you do realize you are testing memory reads and writes using your memcpy() approach? This approach is fine but might produce more unreliable results. A better approach might be to test reads and writes separately and not bother with functions like memset() and memcpy().

Upvotes: 1

A micro benchmark for memory

Answers (4)

Related Questions