Reputation: 33817
I want to make a simple test to see performance differences with and without cache misses.
I want to see that when operating on array X (X fits the cache) the performance is way better than with array Y (Y does not fit the cache). Actually, I want to spot the critical size of the array when the cache misses start to impact the performance.
I made a simple function that accesses an array in a loop. I should get the some performance
for arr_size
that fits the cache and other for arr_size
that does not fit the cache. but I get more less constant performance independently of the arr_size
, even for big sizes (like 20MB). Why is that?
// compiled without optimizations -O0
float benchmark_cache(const size_t arr_size)
{
unsigned char* arr_a = (unsigned char*) malloc(sizeof(char) * arr_size);
unsigned char* arr_b = (unsigned char*) malloc(sizeof(char) * arr_size);
assert( arr_a );
assert( arr_b );
long time0 = get_nsec();
for( size_t i = 0; i < arr_size; ++i ) {
// index k will jump forth and back, to generate cache misses
size_t k = (i / 2) + (i % 2) * arr_size / 2;
arr_b[k] = arr_a[k] + 1;
}
long time_d = get_nsec() - time0;
float performance = float(time_d) / arr_size;
printf("perf %.1f [kB]: %d\n",
performance,
arr_size /1024 );
free(arr_a);
free(arr_b);
return performance;
}
long get_nsec()
{
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
return long(ts.tv_sec)*1000*1000 + ts.tv_nsec;
}
Upvotes: 0
Views: 3257
Reputation: 1156
I've just read what should I know about memory and played around with benchmarking sample. Hope, this helps someone:
struct TimeLogger
{
const char* m_blockName;
const clock_t m_start;
TimeLogger(const char* blockName) : m_blockName(blockName), m_start(clock()) {}
~TimeLogger()
{
clock_t finish = clock();
std::cout << "Done: " << m_blockName << " in " << (finish - m_start) * 1000.0 / CLOCKS_PER_SEC << " ms" << std::endl;
}
};
const size_t k_ITERATIONS = 16;
const size_t k_SIZE = 1024 * 1024 * 16;
uint64_t test(const char* name, const std::vector<int64_t>& data, const std::vector<size_t>& indexes)
{
TimeLogger log = name;
uint64_t sum = 0;
for (size_t i = 0; i < k_ITERATIONS; ++i)
for (size_t index : indexes)
sum += data[index];
return sum;
}
// return shuffled sequences of consecutive numbers like [0,1,2, 6,7,8, 3,4,5, ...]
std::vector<size_t> fillSequences(size_t size, size_t seriesSize, std::mt19937 g)
{
std::vector<size_t> semiRandIdx;
semiRandIdx.reserve(size);
size_t i = 0;
auto semiRandSequences = std::vector<size_t>(size / seriesSize, 0);
std::generate(semiRandSequences.begin(), semiRandSequences.end(), [&i]() { return i++; });
std::shuffle(semiRandSequences.begin(), semiRandSequences.end(), g);
for (size_t seqNumber : semiRandSequences)
for (size_t i = seqNumber * seriesSize; i < (seqNumber + 1) * seriesSize; ++i)
semiRandIdx.push_back(i);
return semiRandIdx;
}
int main()
{
std::random_device rd;
std::mt19937 g(rd());
auto intData = std::vector<int64_t>(k_SIZE, 0);
std::generate(intData.begin(), intData.end(), g);
// [0, 1, 2, ... N]
auto idx = std::vector<size_t>(k_SIZE, 0);
std::generate(idx.begin(), idx.end(), []() {static size_t i = 0; return i++; });
// [N, N-1, ... 0]
auto reverseIdx = std::vector<size_t>(idx.rbegin(), idx.rend());
// random permutation of [0, 1, ... N]
auto randIdx = idx;
std::shuffle(randIdx.begin(), randIdx.end(), g);
// random permutations of 32, 64, 128-byte sequences
auto seq32Idx = fillSequences(idx.size(), 32 / sizeof(int64_t), g);
auto seq64Idx = fillSequences(idx.size(), 64 / sizeof(int64_t), g);
auto seq128Idx = fillSequences(idx.size(), 128 / sizeof(int64_t), g);
size_t dataSize = intData.size() * sizeof(int64_t);
size_t indexSize = idx.size() * sizeof(int64_t);
std::cout << "vectors filled, data (MB): " << dataSize / 1024 / 1024.0 << "; index (MB): " << indexSize / 1024 / 1024.0
<< "; total (MB): " << (dataSize + indexSize) / 1024 / 1024.0 << std::endl << "Loops: " << k_ITERATIONS << std::endl;
uint64_t sum1 = test("regular access", intData, idx);
uint64_t sum2 = test("reverse access", intData, reverseIdx);
uint64_t sum3 = test("random access", intData, randIdx);
uint64_t sum4 = test("random 32-byte sequences", intData, seq32Idx);
uint64_t sum5 = test("random 64-byte sequences", intData, seq64Idx);
uint64_t sum6 = test("random 128-byte sequences", intData, seq128Idx);
std::cout << sum1 << ", " << sum2 << ", " << sum3 << ", " << sum4 << ", " << sum5 << ", " << sum6 << std::endl;
return 0;
}
Curious thing is that CPU's prefetcher greatly optimizes reverse array access. I've found this when compared forward-access time with reverse-access: on my PC performance is the same.
Here is also some results on laptop with 2x32KB L1, 2x256KB L2 and 3MB L3 cache:
vectors filled, data (MB): 512; index (MB): 512; total (MB): 1024
Loops: 1
Done: regular access in 147 ms
Done: reverse access in 119 ms
Done: random access in 2943 ms
Done: random 32-byte sequences in 938 ms
Done: random 64-byte sequences in 618 ms
Done: random 128-byte sequences in 495 ms
...
vectors filled, data (MB): 4; index (MB): 4; total (MB): 8
Loops: 512
Done: regular access in 331 ms
Done: reverse access in 334 ms
Done: random access in 1961 ms
Done: random 32-byte sequences in 1099 ms
Done: random 64-byte sequences in 930 ms
Done: random 128-byte sequences in 824 ms
...
vectors filled, data (MB): 1; index (MB): 1; total (MB): 2
Loops: 2048
Done: regular access in 174 ms
Done: reverse access in 162 ms
Done: random access in 490 ms
Done: random 32-byte sequences in 318 ms
Done: random 64-byte sequences in 295 ms
Done: random 128-byte sequences in 257 ms
...
vectors filled, data (MB): 0.125; index (MB): 0.125; total (MB): 0.25
Loops: 16384
Done: regular access in 148 ms
Done: reverse access in 139 ms
Done: random access in 210 ms
Done: random 32-byte sequences in 179 ms
Done: random 64-byte sequences in 166 ms
Done: random 128-byte sequences in 163 ms
Upvotes: 0
Reputation: 52274
Cache performance improvement occurs when you access several times the same locations without accessing too many other locations between. Here you just access your allocated memory once, you won't see much cache effect.
Even if you change your code to access several times the whole arrays, cache handling logic try to predict your access and generally succeed if the pattern is simple enough. A linear forward access (even split in two), is simple enough.
Upvotes: 1
Reputation: 31851
It's hard to say exactly, but my guess is the the predictive and linear loading of the CPU is helping you quite a bit. That is, since you access the data in order, the moment you hit an uncached value the CPU will load the next block of data. This loading can basically be done in parallel, thus you may not be ever really waiting on the load.
I know that you are trying to jump around, but the read/write order is still quite linear in nature. You simply iterate through two blocks instead of 1. Trying using a cheap random number generator to skip around even more.
Also note that %
is a relatively slow operation, and thus you may be unintentionally measuring that performance instead. Not compiling with optimizations means it probably acutally is used the mod operator, rather than a mask here. Try doing the test with full optimizations turned on as well.
Plus, be sure to set your thread to a fixed cpu affinity with a real-time priority (how you do this depends on your OS). This should limit any context switching overhead.
Upvotes: 1
Reputation: 18864
You should probably be using a tool like cachegrind in cache simulation mode to get sensible results. Otherwise the cache performance is significantly affected by context switching resulted from scheduler work.
Upvotes: 0