BeeOnRope
BeeOnRope

Reputation: 64955

Does clflush also remove TLB entries?

Does clflush1 also flush associated TLB entries? I would assume not since clflush operates at a cache-line granularity, while TLB entries exist at the (much larger) page granularity - but I am prepared to be suprised.


1 ... or clflushopt although one would reasonably assume their behaviors are the same.

Upvotes: 4

Views: 1253

Answers (2)

Hadi Brais
Hadi Brais

Reputation: 23669

The dTLB-loads-misses:u performance event can be used determine whether clflush flushes the TLB entry that maps the specified cache line. This event occurs when a load misses in all TLB levels and causes a page walk. It's also more widely supported compared to dTLB-stores-misses:u. In particular, dTLB-loads-misses:u is supported on the Intel P4 and later (except Goldmont) and on AMD K7 and later.

You can find the code at https://godbolt.org/z/97XkkF. It takes two parameters:

  • argv[1], which specifies whether all lines of the specified 4KB page should be flushed or only a single cache line.
  • argv[2], which specifies whether to use clflush or clflushopt.

The test is simple. It allocates a single 4KB page and accesses the same location a large number of times using a load instruction. Before every access, however, a cache flushing operation is performed as specified by argv[1] and argv[2]. If the flush caused the TLB entry to be evicted, then a dTLB-loads-misses:u event will occur. If the number of events is anywhere close to the number of loads, then we may suspect that the flush had an impact on the TLB.

Use the following commands to compile and run the code:

gcc -mclflushopt -O3 main.c
perf stat -e dTLB-loads-misses:u ./a.out wholePage opt

where wholePage and opt can be 0 or 1. So there are 4 cases to test.

I've run the test on SNB, IVB, HSW, BDW, and CFL. On all processors and in all cases, the number of events is very negligible. You can run the test on other processors.


I've managed to also run a test for WBINVD by calling ioctl in the loop to a kernel module to execute the instruction in kernel mode. I've measured dTLB-loads-misses:u, iTLB-loads-misses:u, and icache_64b.iftag_miss:u. All of them of are very negligible (under 0.004% of 1 million load instructions). This means that WBINVD does not flush the DTLB, ITLB, or the instruction cache. It only flushes the data caches.

Upvotes: 3

Peter Cordes
Peter Cordes

Reputation: 364448

I think it's safe to assume no; baking invlpg into clflush sounds like an insane design decision that I don't think anyone would make. You often want to invalidate multiple lines in a page. There's also no apparent benefit; flushing the TLB as well doesn't make it any easier to implement data-cache flushing.

Even just dropping the final TLB entry (without necessarily invalidating any page-directory caching) would be weaker than invlpg but still not make sense.

All modern x86s use caches with physical indexing/tagging, not virtual. (VIPT L1d caches are really PIPT with free translation of the index because it's taken from address bits that are part of the offset within a page.) And even if caches were virtual, invalidating TLB entries requires invaliding virtual caches but not the other way around.


According to IACA, clflush is only 2 uops on HSW-SKL, and 4 uops (including micro-fusion) on NHM-IVB. So it's not even micro-coded on Intel.

IACA doesn't model invlpg, but I assume it's more uops. (And it's privileged so it's not totally trivial to test.) It's remotely possible those extra uops on pre-HSW were for TLB invalidation.

I don't have any info on AMD.


The fact that invlpg is privileged is another reason to expect clflush not to be a superset of it. clflush is unprivileged. Presumably it's only for performance reasons that invlpg is restricted to ring 0 only.

But invlpg won't page-fault, so user-space could use it to invalidate kernel TLB entries, delaying real-time processes and interrupt handlers. (wbinvd is privileged for similar reasons: it's very slow and I think not interruptible.) clflush does fault on illegal addresses so it wouldn't open up that denial-of-service vulnerability. You could clflush the shared VDSO page, though.

Unless there's some reason why a CPU would want to expose invlpg in user-space (by baking it in to clflush), I really don't see why any vendor would do it.


With non-volatile DIMMs in the future of computing, it's even less likely that any future CPUs will make it super-slow to loop over a range of memory doing clflush. You'd expect most software using memory mapped NV storage to be using clflushopt, but I'd expect CPU vendors to make clflush as fast as possible, too.

Upvotes: 4

Related Questions