ishan3243
ishan3243

Reputation: 1928

Off Chip Cache Coherence and L2 cache partitioning in multicores (a programmer's view)

Well I recently studied that in order to save chip-area, multicore processors don't have the cache coherence hardware at the L1 level. Rather the L2 cache is partitioned (no. of partitions = no. of hyperthreads or whatever) to enforce off-chip cache coherence. Atleast this is what I interpreted from the lecture. Is this correct?

If yes, then I am unable to visualize how this is even possible. How can you ignore the coherence at L1 level? If my interpretation is incorrect then please shed some light on off-chip cache coherence and why the L2 is partitioned..

Thanks!

Upvotes: 0

Views: 357

Answers (1)

user2467198
user2467198

Reputation:

The lecture was probably indicating that the L1 cache in a multicore processor is not generally snooped to maintain coherence. Instead a higher level of the cache hierarchy filters coherence traffic. With a fully inclusive (in tags only or tags and data) level of cache, extra bits can provide a local coherence directory--e.g., a bit vector of all cores or larger nodes indicating if the node has the cache block. (This directory may be used as a filter rather than an exact tracking, e.g., to avoid buffering on lower-level cache evictions.) Other forms of filtering are also possible. The primary requirement is that all cases where the data is present in a lower level cache are detected, a modest fraction of false positives would only modestly increase the amount of snoop traffic going to the lower level caches.

Without such a filter, every miss on another core/node would have to probe all the other L1 caches. In addition to using more interconnect bandwidth, this extra tag probing requirement would typically be handled by replicating the L1 tags because L1 caches are highly optimized for latency and access bandwidth (making it more desirable to avoid unnecessary interference from coherence probes).

In a common multicore processor with on-chip L3, L2 caches are "private" to a node of one or a small number of cores. (Private in this context means that allocations are driven by the cores within the node. This L2 capacity is not used by other nodes.) Such a private L2 filters accesses from reaching the shared L3 on a hit (as long as it does not require an update to exclusive/modified status). By sharing L2 cache among only a small number (often one) of cores, access latency is kept lower both by more direct connection to the cores and by requiring a lower capacity. (Sharing L2 among two or even four cores can reduce the number of nodes in the higher level network and balance utilization of L2 capacity.)

The last (on-chip) level of cache (LLC) is often partitioned. Attaching a slice to each lower level node allows that slice to have lower latency for communication with that node. Cache blocks that are accessed by that node can be preferentially placed in that slice or in a nearby (by network topology) slice to allow lower latency (and potentially higher bandwidth) local access. (This is a Non-Uniform Cache Architecture optimization. Because blocks are not tied to a specific slice based on address or accessing node it is possible to migrate and even replicate blocks.)

Alternately, allocation to the LLC slices can be strictly based on address, possibly associating each LLC slice with a memory controller. This requires only one slice to be probed to determine a hit or miss and fits with the use of a crossbar interconnect between lower level nodes and the LLC slices. This arrangement has the disadvantages that the memory controller-LLC connection is less latency critical and that utilization is tied to balanced demand based on address. However, it can provide faster determination of an L3 hit/miss and may (if slices are associated with memory controllers) reduce overhead for prefetching from memory and eager writeback. (When misses are more common and/or blocks are frequently shared by multiple nodes, address-based allocation becomes more attractive because a miss only needs to probe one slice [in addition to possibly supporting more aggressive prefetching and more likely being memory bandwidth limited rather than LLC capacity limited--so imbalance in memory controller use would be bad anyway] and a shared block can be more directly accessed by all of the nodes that use it without replication.)

(Obviously combinations of these two allocation methods can be used. Even just biasing allocation based on address could reduce demand on interconnect bandwidth.)

Partitioning tends to reduce latency (especially with a NUCA arrangement) and design complexity as well as facilitate design reuse with different numbers of partitions (and perhaps defect isolation so that a chip with a manufacturing defect can more easily be used as a product with fewer partitions).

Upvotes: 3

Related Questions