Reputation: 23
i found in linux, it shows my cpu's cache line size is 64 byte, and i realized 16/32/128 byte is existing, but most cpu are designed to 64 byte cache line size now. why not bigger or smaller?
Upvotes: 2
Views: 6478
Reputation: 126508
It's a trade-off. Wider caches are more efficient (in terms of area/power for a given cache size), but result in more memory traffic for random (non-sequential/strided) access, and more false sharing contention between parallel caches.
if you have a memory access pattern that only needs a few bytes from each cache line (eg, iterating along a linked list that is scattered widely across memory), each access will need to pull an entire line into the cache. So doubling the line size will double the memory traffic.
if different CPUs, each with its own cache, are accessing memory on the same cache line, that line will have to "bounce" back and forth between the caches. Avoiding this means putting more padding between objects.
In both cases, these problems can be avoided by tuning the software to want memory in chunks that are multiples of the cache line size. The bigger the cache line size, the more work that is.
Upvotes: 8
Reputation:
As Chris Dodd's answer points out, the sizing of cache lines involves trade-offs.
Larger cache lines reduce the number of tag bits per data byte, provide prefetching, and facilitate higher bandwidth (particularly at the memory and the L1 interfaces) at the cost of excessive prefetch (wasting bandwidth and cache capacity), false sharing, higher miss latency (especially without critical word first/early restart), and higher conflict misses (for smaller caches, with fewer sets the probability that more accesses than the associativity will map to a particular set increases). (Larger cache lines can also provide greater performance predictability by guaranteeing a cache hit within a larger address range and number of bytes.)
Modern systems would not noticeably benefit from such prefetching; configurable static prefetcher logic would provide the same behavior and dynamic prefetching can exploit variable resource availability (e.g., cache capacity and memory channel occupancy) and utility as well as provide more flexible prefetching (such as non-unit stride).
Tag overhead is not as significant a concern in terms of area for modern caches using SRAM for data as well as tags. (IBM's Power and zArchitecture implementations use eDRAM for outer cache data storage and SRAM for tags, which more than doubles the area cost of tags relative to data.) However, access latency and access energy are effected by the size of the tag arrays. For L1 caches, way prediction is more effective with larger cache lines both because there are fewer cache lines for a given cache capacity and because spatial locality tends to apply even beyond reasonable cache line sizes; only having to check one set of tags for a wider or larger number of accesses reduces the cost of higher bandwidth (this is most noticeable in GPUs which exploit spatial locality and sacrifice latency for bandwidth). For outer cache levels, phased tag-data access is often used (tags are checked before data access begins, saving energy — especially given higher associativity and miss rates); smaller tag arrays for a given capacity both reduce access energy and latency (especially for misses — 50% hit rates are not unheardof). (Note that one can use partial tags to provide early miss detection in the common case where a miss has no matching partial tags. Other filtering mechanisms are also possible.)
False sharing can be countered by using sectored caches, where more than one validity (or coherence state) entry is provided for each address portion of the tag. This provides an intermediate design point between larger cache lines with more frequent false sharing and smaller cache lines with higher tag overhead. Such also inherently supports reducing cache line fill delay. For traditional layouts, this has the substantial effective capacity cost of large cache lines when false sharing or less spatial locality is more common. For designs using indirection, such as proposed Non-Uniform Cache Architectures and V-Way caches, the capacity utilization issue can be reduced by allocating data storage at a finer granularity at the cost of more indirection pointer storage.
Larger cache lines provide three bandwidth benefits. The command overhead is less (address and action information is nearly constant — address is one bit smaller for each doubling in size) so the bandwidth overhead per data byte is lower; this is more significant for coherence traffic where many messages carry only metadata. (Obviously with more numerous coherence nodes false sharing can be more problematic acting against this advantage.) Other per-request overheads also do not scale with request size (e.g., DRAM row activation with random-access, close-on-completion row management). ECC (or check codes with retransmission) also have less per payload byte overhead with larger payloads (this can be used to store extra metadata while using commodity width memory modules).
A larger cache line also facilitates wider memory interfaces when burst length is fixed. Increasing DRAM burst length facilitates higher bandwidth; DDR5 moved to a burst length of 16, pushing DIMMs into using two 32-bit wide channels to be compatible with x86's de facto standardization on 64-byte cache lines. While this chance can be viewed positively as increasing available memory level parallelism (MLP)— doubling the number of channels and reducing DRAM bank conflicts — MLP is more important when relative latency of memory is greater (large on-chip caches and faster processing), thread-level parallelism is available (multicore and multithreading), and out-of-order execution (and multithreading) expose more memory accesses to hide latency. Multicore (when used with significant data memory sharing as opposed to multiprogramming or large chunk/stream communication such as pipeline-style mulithreading) also increases the importance of false sharing, further reducing the benefits of larger cache lines (beyond MLP benefits of narrower channels). With lower (on-chip) communication latency and near-unavoidability of multicore processors, multithreaded programming becomes more attractive.
For L1 caches the microarchitecture (and somewhat the ISA) can be influence cache line sizes. Higher frequency designs favor smaller capacity L1 caches both for latency and access energy, especially with less latency tolerance from out-of-order execution or skewed pipelines (where the execution pipeline phase is one or more stages delayed from the address generation phase).
The relative sizes of the various tradeoffs also depends on the workload and software design. Larger cache capacity caches targeting workloads that benefit from such reduce the excessive prefetch and conflict disadvantages of larger cache lines; higher associativity reduces the conflict disadvantage, but workloads that are more likely to have conflicts are less likely (in general) to benefit from spatial locality (and the conflict disadvantage is typically less important for outer cache levels). Pointer-chasing workloads tend to favor lower latency and thus lower capacity, favoring smaller cache lines (at least in L1).
Software design is a significant factor. Avoiding false sharing tends to increase padding as cache line size increases, discouraging larger cache lines. Once a cache line size assumption is established in a software community (which is somewhat segregated according to ISA, OS, and hardware/system vendor) the effect of legacy code and legacy conceptualizations constrains cache line size.
Speculation: x86's orientation toward generic software and personal computer uses (cost and workload characteristics biasing toward smaller caches and workload perhaps generally having lower spatial locality) probably biased the choice toward a smaller cache line than ISA/hardware vendors targeting workstation and server workloads with a higher expectation of software development effort. x86 has standardized on 64-byte cache lines, IBM POWER9 uses 128-byte cache blocks (divided into four sectors for L1 caches) and IBM z15 uses 256-byte cache blocks.
(Latency vs. hit rate, access energy, and other tradeoffs as well as software and programmer legacy seem to lead to less strict standardization on 32KiB L1 cache capacity. The performance impact for a smaller or larger cache can be less significant than for false sharing, so the software constraints are less significant than for cache line size.)
Upvotes: 7