Slowdown when accessing data at page boundaries?

Question

(My question is related to computer architecture and performance understanding. Did not find a relevant forum, so post it here as a general question.)

I have a C program which accesses memory words that are located X bytes apart in virtual address space. For instance, for (int i=0;;i+=X){array[i]=4;}.

I measure the execution time with a varying value of X. Interestingly, when X is the power of 2 and is about page size, e.g., X=1024,2048,4096,8192..., I get to huge performance slowdown. But on all other values of X, like 1023 and 1025, there is no slowdown. The performance results are attached in the figure below.

I test my program on several personal machines, all are running Linux with x86_64 on Intel CPU.

What could be the cause of this slowdown? We have tried row buffer in DRAM, L3 cache, etc. which do not seem to make sense...

Update (July 11)

We did a little test here by adding NOP instructions to the original code. And the slowdown is still there. This sorta veto the 4k alias. The cause by conflict cache misses is more likely the case here.

Peter Cordes · Accepted Answer

There's 2 things here:

Set-associative cache aliasing creating conflict misses if you only touch the multiple-of-4096 addresses. Inner fast caches (L1 and L2) are normally indexed by a small range of bits from the physical address. So striding by 4096 bytes means those address bits are the same for all accesses so you're only one of the sets in L1d cache, and some small number in L2.

Striding by 1024 means you'd only be using 4 sets in L1d, with smaller powers of 2 using progressively more sets, but non-power-of-2 distributing over all the sets. (Intel CPUs have used 32KiB 8-way associative L1d caches for a long time; 32K/8 = 4K per way. Ice Lake bumped it up to 48K 12-way, so the same indexing where the set depends only on bits below the page number. This is not a coincidence for VIPT caches that want to index in parallel with TLB.)

But with a non-power-of-2 stride, your accesses will be distributed over more sets in the cache. Performance advantages of powers-of-2 sized data? (answer describes this disadvantage)

Which cache mapping technique is used in intel core i7 processor? - shared L3 cache is resistant to aliasing from big power-of-2 offsets because it uses a more complex indexing function.
4k aliasing (e.g. in some Intel CPUs). Although with only stores this probably doesn't matter. It's mainly a factor for memory disambiguation, when the CPU has to quickly figure out if a load might be reloading recently-stored data, and it does so in the first pass by just looking just at page-offset bits.

This is probably not what's going on for you, but for more details see:
L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes and
Why are elementwise additions much faster in separate loops than in a combined loop?

Either or both of these effects could be a factor in Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?

Another possible factor is that HW prefetching stops at physical page boundaries. Why does the speed of memcpy() drop dramatically every 4KB? But changing a stride from 1024 to 1023 wouldn't help that by a big factor. "Next-page" prefetching in IvyBridge and later is only TLB prefetching, not data from the next page.

I kind of assumed x86 for most of this answer, but the cache aliasing / conflict-miss stuff applies generally. Set-associative caches with simple indexing are universally used for L1d caches. (Or on older CPUs, direct-mapped where each "set" only has 1 member). The 4k aliasing stuff might be mostly Intel-specific.

Prefetching across virtual page boundaries is likely also a general problem.

Slowdown when accessing data at page boundaries?

Answers (1)

Related Questions