Reputation: 64955
Does clflush
1 also flush associated TLB entries? I would assume not since clflush
operates at a cache-line granularity, while TLB entries exist at the (much larger) page granularity - but I am prepared to be suprised.
1 ... or clflushopt
although one would reasonably assume their behaviors are the same.
Upvotes: 4
Views: 1253
Reputation: 23669
The dTLB-loads-misses:u
performance event can be used determine whether clflush
flushes the TLB entry that maps the specified cache line. This event occurs when a load misses in all TLB levels and causes a page walk. It's also more widely supported compared to dTLB-stores-misses:u
. In particular, dTLB-loads-misses:u
is supported on the Intel P4 and later (except Goldmont) and on AMD K7 and later.
You can find the code at https://godbolt.org/z/97XkkF. It takes two parameters:
argv[1]
, which specifies whether all lines of the specified 4KB page should be flushed or only a single cache line.argv[2]
, which specifies whether to use clflush
or clflushopt
.The test is simple. It allocates a single 4KB page and accesses the same location a large number of times using a load instruction. Before every access, however, a cache flushing operation is performed as specified by argv[1]
and argv[2]
. If the flush caused the TLB entry to be evicted, then a dTLB-loads-misses:u
event will occur. If the number of events is anywhere close to the number of loads, then we may suspect that the flush had an impact on the TLB.
Use the following commands to compile and run the code:
gcc -mclflushopt -O3 main.c
perf stat -e dTLB-loads-misses:u ./a.out wholePage opt
where wholePage
and opt
can be 0 or 1. So there are 4 cases to test.
I've run the test on SNB, IVB, HSW, BDW, and CFL. On all processors and in all cases, the number of events is very negligible. You can run the test on other processors.
I've managed to also run a test for WBINVD
by calling ioctl
in the loop to a kernel module to execute the instruction in kernel mode. I've measured dTLB-loads-misses:u
, iTLB-loads-misses:u
, and icache_64b.iftag_miss:u
. All of them of are very negligible (under 0.004% of 1 million load instructions). This means that WBINVD
does not flush the DTLB, ITLB, or the instruction cache. It only flushes the data caches.
Upvotes: 3
Reputation: 364448
I think it's safe to assume no; baking invlpg
into clflush
sounds like an insane design decision that I don't think anyone would make. You often want to invalidate multiple lines in a page. There's also no apparent benefit; flushing the TLB as well doesn't make it any easier to implement data-cache flushing.
Even just dropping the final TLB entry (without necessarily invalidating any page-directory caching) would be weaker than invlpg
but still not make sense.
All modern x86s use caches with physical indexing/tagging, not virtual. (VIPT L1d caches are really PIPT with free translation of the index because it's taken from address bits that are part of the offset within a page.) And even if caches were virtual, invalidating TLB entries requires invaliding virtual caches but not the other way around.
According to IACA, clflush
is only 2 uops on HSW-SKL, and 4 uops (including micro-fusion) on NHM-IVB. So it's not even micro-coded on Intel.
IACA doesn't model invlpg
, but I assume it's more uops. (And it's privileged so it's not totally trivial to test.) It's remotely possible those extra uops on pre-HSW were for TLB invalidation.
I don't have any info on AMD.
The fact that invlpg
is privileged is another reason to expect clflush
not to be a superset of it. clflush
is unprivileged. Presumably it's only for performance reasons that invlpg
is restricted to ring 0 only.
But invlpg
won't page-fault, so user-space could use it to invalidate kernel TLB entries, delaying real-time processes and interrupt handlers. (wbinvd
is privileged for similar reasons: it's very slow and I think not interruptible.) clflush
does fault on illegal addresses so it wouldn't open up that denial-of-service vulnerability. You could clflush
the shared VDSO page, though.
Unless there's some reason why a CPU would want to expose invlpg
in user-space (by baking it in to clflush
), I really don't see why any vendor would do it.
With non-volatile DIMMs in the future of computing, it's even less likely that any future CPUs will make it super-slow to loop over a range of memory doing clflush
. You'd expect most software using memory mapped NV storage to be using clflushopt
, but I'd expect CPU vendors to make clflush
as fast as possible, too.
Upvotes: 4