Reputation: 21615
We know that cache uses virtual addresses. So, how does this work when multiple processes are involved, especially for the shared caches, such as shared L2 cache, or even for a local L1 cache, when processes are switched, as in simultaneous (hyper) multithreading, you could have threads from two different processes running on the same physical core. Is hyperthreading any good when threads from different processes are involved, or can it only boost performance when threads of the same process are involved?
Upvotes: 7
Views: 1660
Reputation: 364220
None of the major x86 CPU microarchitectures use virtually-addressed caches. They all use virtually-indexed / physically-tagged (VIPT) L1 caches. VIPT is a performance hack that allows tags from a set to be fetched in parallel with a TLB lookup.
The bits of the address that are used as the index are the same in the physical and virtual addresses. (i.e. they're part of the offset within a 4k page, so they don't need to be translated by the TLB). This means that it effectively behaves exactly like a phys/phys (PIPT) cache, avoiding all problems of virtual addressing.
This is made possible by keeping the cache small and having enough ways. Intel's L1 caches are 32kiB, 8-way associativity, with 64B lines. This accounts for all the within-a-page address bits. (see other resources for diagrams and more detailed explanations.)
Hyperthreading works fine with separate processes, because x86 CPUs avoid cache aliasing (synonym / homonym problems). They work like physically-addressed caches. Two memory-intensive processes that don't share any memory might run slower with hyperthreading than without, though. Competitive sharing of the caches can be worse than just running one process after the other finishes, if that's an option.
For processes that bottleneck on something other than a resource that hyperthreading shares, HT certainly helps. e.g. with branch mispredicts. Also with cache misses due to unpredictable access to a big working set that would still miss often without hyperthreading.
CPUs that use virt/virt caches do need to invalidate them on context switches, or have extra tags to keep track of which PID they were for. This is like what caches currently do to support virtualization: they're tagged with VM IDs, so they know which VM's physical address it's for. virt/virt L1 means you don't need a fast TLB: it's only needed on L1 misses, so the L1 cache is also caching translations.
Some designs must use phys/phys L1, but I don't know any specific examples. The virt/phys trick is pretty common in high-performance CPUs, because an L1 with enough ways to make it possible is just a good idea anyway.
Note that only L1 ever uses virt addresses. Big L2 and L3 caches are always phys/phys.
Other links:
http://www.realworldtech.com/forum/?threadid=76592&curpostid=76600 This whole thread goes into a bunch of detail and questions about caches. David Kanter's posts tend to explain things in a readable way. I haven't read the whole thread. RWT forums are now searchable! so if you google for more details, you're likely to see more hits from years of forum threads there.
Paul Clayton explains why phys-idx/virt-tag (PIVT) is such a bad idea that nobody would ever build one: it has the disadvantages of virtually-addressed caches without any advantages. (Wikipedia says that MIPS r6000 is the only known implementation, and gives the extremely esoteric reason: Even a TLB would be too large to implement in Emitter-Coupled Logic, so they implement a TLB slice to just translate enough bits for a physical index. Given that limitation, PIPT and VIPT were not options, and they decided to go PIVT instead of VIVT.
Another very-detailed answer from Paul Clayton about caches.
Upvotes: 8