Reputation: 30615
If I have this class:
class MyClass{
short a;
short b;
short c;
};
and I have this code performing calculations on the above:
std::vector<MyClass> vec;
//
for(auto x : vec){
sum = vec.a * (3 + vec.b) / vec.c;
}
I understand the CPU only loads the very data it needs from the L1 cache, but when the L1 cache retrieves data from the L2 cache it loads a whole "cache line" (which could include a few bytes of data it doesn't need).
How much data does the L2 cache load from the L3 cache, and the L3 cache load from main memory? Is it defined in terms of pages and if so, how would this answer differ according to different L2/L3 cache sizes?
Upvotes: 3
Views: 2009
Reputation: 3073
It depends somewhat on the ISA and microarchitecture of your platform. Recent x86-64 based microarchitectures use 64 byte lines in all levels of the cache hierarchy.
Typically signed shorts will require two bytes each meaning that MyClass will need 6 bytes in addition the class overhead. If your C++ implementation stores the vector<> contiguously like an array you should get about 10 MyClass objects per 64-byte lines. Provided the vector<> is the right length, you won't load much garbage.
It's wise to note that since you're accessing the elements in a very predictable pattern the hardware prefetcher should kick in and fetch a reasonable amount of data it expects to use in the future. This could potentially bring more than you need into various levels of the cache hierarchy. It will vary from chip to chip.
Upvotes: 4
Reputation:
L2 and L3 caches also have cache lines that are smaller than a virtual memory system page. The size of L2 and L3 cache lines is greater than or equal to the L1 cache line size, not uncommonly being twice that of the L1 cache line size.
For recent x86 processors, all caches use the same 64-byte cache line size. (Early Pentium 4 processors had 64-byte L1 cache lines and 128-byte L2 cache lines.)
IBM's POWER7 uses 128-byte cache blocks in L1, L2, and L3. (However, POWER4 used 128-byte blocks in L1 and L2, but sectored 512-byte blocks in the off-chip L3. Sectored blocks provide a valid bit for subblocks. For L2 and L3 caches, sectoring allows a single coherence size to be used throughout the system.)
Using a larger cache line size in last level cache reduces tag overhead and facilitates long burst accesses between the processor and main memory (longer bursts can provide more bandwidth and facilitate more extensive error correction and DRAM chip redundancy), while allowing other levels of cache and cache coherence to use smaller chunks which reduces bandwidth use and capacity waste. (Large last level cache blocks also provide a prefetching effect whose cache polluting issues are less severe because of the relatively high capacity of last level caches. However, hardware prefetching can accomplish the same effect with less waste of cache capacity.) With a smaller cache (e.g., typical L1 cache), evictions happen more frequently so the time span in which spatial locality can be exploited is smaller (i.e., it is more likely that only data in one smaller chunk will be used before the cache line is evicted). A larger cache line also reduces the number of blocks available, in some sense reducing the capacity of the cache; this capacity reduction is particularly problematic for a small cache.
Upvotes: 7