Optimal buffer size to avoid cache misses for recent i7 / i9 CPUs

Let's assume an algorithm is repeatedly processing buffers of data, it may be accessing say 2 to 16 of these buffers, all having the same size. What would you expect to be the optimum size of these buffers, assuming the algorithm can process the full data in smaller blocks.

I expect the potential bottleneck of cache misses if the blocks are too big, but of course the bigger the blocks the better for vectorization.

Let's expect current i7/i9 CPUs (2018)

Any ideas?

Upvotes: 0

Views: 290

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 365517

Do you have multiple threads? Can you arrange things so the same thread uses the same buffer repeatedly? (i.e. keep buffers associated with threads when possible).

Modern Intel CPUs have 32k L1d, 256k L2 private per-core. (Or Skylake-AVX512 has 1MiB private L2 caches, with less shared L3). (Which cache mapping technique is used in intel core i7 processor?)

Aiming for L2 hits most of the time is good. L2 miss / L3 hit some of the time isn't always terrible, but off-core is significantly slower. Remember that L2 is a unified cache, so it covers code as well, and of course there's stack memory and random other demands for L2. So aiming for a total buffer size of around half L2 size usually gives a good hit-rate for cache-blocking.

Depending on how much bandwidth your algorithm can use, you might even aim for mostly L1d hits, but small buffers can mean more startup / cleanup overhead and spending more time outside of the main loop.


Also remember that with Hyperthreading, each logical core competes for cache on the physical core it's running on. So if two threads end up on the same physical core, but are touching totally different memory, your effective cache sizes are about half.


Probably you should make the buffer size a tunable parameter, and profile with a few different sizes.

Use perf counters to check if you're actually avoiding L1d or L2 misses or not, with different sizes, to help you understand whether your code is sensitive to different amounts of memory latency or not.

Upvotes: 1

Related Questions