Shangchih Huang
Shangchih Huang

Reputation: 317

What is the advantage of caching an entire line instead of a single byte or word at a time?

To use cache memory, main memory is divided into cache lines, typically 32 or 64 bytes long. An entire cache line is cached at once. What is the advantage of caching an entire line instead of a single byte or word at a time?

Upvotes: 3

Views: 3140

Answers (3)

Ahx
Ahx

Reputation: 8005

  • The main advantage of caching an entire line is the probability of the next cache-hit is increased.

    • From Tanenbaum's "Modern Operating Systems" book:

    • Cache-hit: When the program needs to read a memory word, the cache hardware checks to see if the line needed is in the cache.


  • If we don't have a cache-hit then cache-miss will occur. A memory request is sent to the main memory.

  • As a result, more time will be spent to complete the process, since searching inside the memory is costly.

  • We can tell that, caching an entire line will increase the probability of completing the process in two-cycles.

Upvotes: 0

user2467198
user2467198

Reputation:

Having cache block size equal to the smallest addressable size would mean, if a larger size access is supported, multiple tags would have to be checked for such larger accesses. While parallel tag checking is often used for set associative caches, a four-fold increase (8-bit compared to 32-bit) in the number of tags to check would increase access latency and greatly increase energy cost. In addition, such introduces the possibility of partial hits for larger accesses, increasing the complexity of sending the data to a dependent operation or internal storage. While data can be speculatively sent by assuming a full hit (so latency need not be hurt by the possibility of partial hits), the complexity budget is better not spent on supporting partial hits.

32-bit cache blocks, when the largest access size is 32 bits, would avoid the above-mentioned issues, but would use a significant fraction of storage for tags. E.g., a 16KiB direct-mapped cache in a 32-bit address space would use 18 bits for the address portion of the tag; even without additional metadata such as coherence state, tags would use 36% of the storage. (Additional metadata might be avoided by having a 16KiB region of the address space be non-cacheable; a tag matching this address region would be interpreted as "invalid".)

Besides the storage overhead, having more tag data tends to increase latency (smaller tag storage facilitates earlier way selection) and access energy. In addition, having a smaller number of blocks for a cache of a given size makes way prediction and memoization easier, these are used to reduce latency and/or access energy.

(The storage overhead can be a significant factor when it allows tags to be on chip while data is too large to fit on chip. If data uses a denser storage type than tags — e.g., data in DRAM and tags in SRAM with a four-fold difference in storage density —, lower tag overhead becomes more significant.)

If caches only exploited temporal locality (the reuse of a memory location within a "short" period of time), this would typically be the most attractive block size. However, spatial locality of access (accesses to locations near an earlier access often being close in time) is common. Taken control flow instructions are typically less than a sixth of all instructions and many branches and jumps are short (so the branch/jump target is somewhat likely to be within the same cache block as the branch/jump instruction if each cache block holds four or more instructions). Stack frames are local to a function (concentrating the timing of accesses, especially for leaf functions, which are common). Array accesses often use unit stride or very small strides. Members of a structure/object tend to be accessed nearby in time (conceptually related data tends to be related in action/purpose and so accessed nearer in time). Even some memory allocation patterns bias access toward spatial locality; related structures/objects are often allocated nearby in time — if the preferred free memory is not fragmented (which would happen if spatially local allocations are freed nearby in time, if little memory has been freed, or if the allocator is clever in reducing fragmentation, then such allocations are more likely to be spatially local.

With multiple caches, coherence overhead also tends to be reduced with larger cache blocks (under the assumption spatial locality). False sharing increases coherence overhead (similar to lack of spatial locality increasing capacity and conflict misses).

In this sense, larger cache blocks can be viewed as a simple form of prefetching (even with respect to coherence). Prefetching trades bandwidth and cache capacity for a reduction in latency via cache hits (as well as from increasing the useful queue size and scheduling flexibility). One could gain the same benefit by always prefetching a chunk of memory into multiple small cache blocks, but the capacity benefit of finer-grained eviction would be modest because spatial locality of use is common. In addition, to avoid prefetching data that is already in the cache, the tags for the other blocks would have to be probed to check for hits.

With simple modulo-power-of-two indexing and modest associativity, two spatially nearby blocks are more likely to conflict and evict earlier another blocks with spatial locality (index A and index B will have the same spatial locality relationship for all addresses mapping to indexes within a larger address range). With LRU-oriented replacement, accesses within a larger cache block reduce the chance of a too-early eviction when spatial locality is common at the cost of some capacity and conflict misses.

(For a direct-mapped cache, there is no difference between always prefetching a multi-block aligned chunk and using a larger cache block, so paying the extra tag overhead would be pointless.)

Prefetching into a smaller buffer would avoid cache pollution from used data, increasing the benefit of smaller block size, but such also reduces the temporal scope of the spatial locality. A four-entry prefetch buffer would only support spatial locality within four cache misses; this would catch most stream-like accesses (rarely will more than four new "streams" be active at the same time) and many other cases of spatial locality but some spatial locality is over a larger span of time.

Mandatory prefetching (whether from larger cache blocks or a more flexible mechanism) provides significant bandwidth advantages. First, the address and request type overhead is spread over a larger amount of data. 32 bits of address and request type overhead per 32 bit access uses 50% of the bandwidth for non-data but less than 12% when 256 bits of data are transferred.

Second, the memory controller processing and scheduling overhead can be more easily averaged over more transferred data.

Finally, DRAM chips can provide greater bandwidth by exploiting internal prefetch. Even in the days of Fast Page Mode DRAM, accesses within the same DRAM page were faster and higher bandwidth (less page precharge and activation overhead); while non-mandatory prefetch could exploit such and be more general, the control and communication overheads would be larger. Modern DRAMs have minimum burst lengths (and burst chop merely drops part of the DRAM-chip-internal prefetch — the internal access energy and array occupation are nor reduced).

The ideal cache block size depends on workload ('natural' algorithm choices and legacy optimization assumptions, data set sizes and complexity, etc.), cache sizes and associativity (larger and more associative caches encourage larger blocks), available bandwidth, use of in-cache data compression (which tends to encourage larger blocks), cache block sectoring (where validity/coherence state is tracked at finer granularity than the address), and other factors.

Upvotes: 1

itsfreeandtakesminute
itsfreeandtakesminute

Reputation: 41

This is done to exploit the principle of locality; spatial locality to be precise. This principle states that the data bytes which lie close together in memory are likely to be referenced together in a program. This is immediately apparent when accessing large arrays in loops. However, this is not always true (e.g. pointer based memory access) and hence it is not advisable to fetch data from memory at more than the granularity of cache lines (in case the program does not have locality of reference) since cache is a very limited and important resource.

Upvotes: 2

Related Questions