Reputation: 59

How to decide row / column strides for a loop over a matrix get these cache hit rates?

Given CPU with:
L1 cache: 4-ways, block size = 32 bytes , cache size = 64KB , LRU (Cache replacement policy).
L2 cache: 2-ways, block size = 32 bytes , cache size = 512KB , LRU (Cache replacement policy).

Array: int32_t A[2048][128]; so element size is 4 bytes.

for(int k=0, k<50, k++)
   for(int i=0; i<2048; i+=Di)
       for(int j=0; j<128; j+=Dj)  
           S += A[i][j]^k;

For which values of the constants Di and Dj can I get L1 hit rate = 75% and L2 hit rate bigger than 90%?

I tried to do the following:
I know that for L1 cache the address is: offset = 5 bits, set = 9 bits.
And hence, after 2^14 B = 2^12 elements we arrive to same set (but in other way).
and in L2 cache: offset = 5 bits, set = 13 bits.
And hence, after 2^18 B = 2^16 elements we arrive to same set (but in other way). And because that we want L1 hit rate = 3/4 . So we will take Dj=2. And so we will get for any 4 elements: miss, hit, hit, hit. (any block of L1 can contain 8 elements).

But I don't know how to decide value for Di.

Upvotes: 1

Answers (1)

jordanvrtanoski

Reputation: 5557

Theoretical background

To decide the rows/columns (strides) of memory that we want to read in order to get the desired cache hit/miss ratio, we need to understand how the L1 and L2 cache work and what is the meaning of the input parameters we are given for this problem. In the question we already have the final calculation, but I will quickly explain how we had arrived to this numbers:

The cache is organised in blocks and sets. Each set can contain n number of blocks, so we call this "n-way set associative" cache. In the example we have 4-way and 2-way. The address of a byte in the block is constructed as follows:

Cache address = INDEX[0:N] + OFFSET[0:M]

where INDEX is representing the index of the set and has N bits and offset is determined by the size of the block and has M bits.

Starting from L1 cache, we are saying that this cache is "4-way". This means that the cache is organised in to sets of 32 bytes and each set can have a mapping to 4 different L2 blocks associated with it. Since we have 64KB for the L1, and each block is 32bit, we will have 2,048 blocks (64K/32), or 512 blocks per set (2,048/4). The address of a set is determined by the index, so our L1 index has 9 bits (N = 9). The byte in the block is determined by the offset, and since we have 32 bytes (2^5) we will have 5 bits (M=5) for the offset.

L2 cache is "2-way" with 512KB with 32bit blocks, so number of blocks will be 16384 (512KB/32), or 8,192 sets (16,384/2) therefore N=13. Since the L2 block is 32 bytes the offset will be 5 bits (M=5).

Solution

For a start we will assume that the cache is dedicated for the A array, meaning the variable k, i, j and S are stored in registers, so read/write for them will not cause memory operations.

When the instruction S += A[i][j]^k; is executed, the processor will need to fetch the element from the A on i row and j column. Assuming that we just started the execution, the cache is empty, therefore when we try to access A[i][j], the CPU will try to load the value from L1. Since L1 is empty, we will have "miss" for L1 and which will cause the CPU to try to load the data from L2. L2 is also empty, so we have "miss" for L2 as well and the CPU will move the whole block (32 bytes) from the memory to L2 and then to L1.

Since "A" is 32 bit (4 bytes), the block contains 8 elements. If the subsequent reads are all from the same block, the block will already be in the L1 cache, so we will have "hit". This will give us 7/8 hit ratio (87.5%) which is more than what we are looking for, so to lower the ration and to make it 3/4 (75%) we need to assure we read each second element from A (assuming A is aligned to the block size boundary). To do so, we need to select Dj to be 2.

To assure we keep the 75%, we need to make sure that the L1 will be overwritten, otherwise on the next cycle we will find the values in the cache which will cause our hit ratio to be 100% on the subsequent cycles. This will not be a problem as long as we assure that we are reading more than 2,048 blocks in the cycle, which means that we need to read more than 8,192 (16,384/2) elements. And to assure this we will use Di.

So, from what we know at this moment, we have:

   (max(i)/ Di) * (max(j)/Dj) * size(A[0][0]) / block size  > 2048   :>
   ((2048 * 128/2 * 4) / 32) / Di > 2048                             :> 
   16384 / Di > 2048                                                 :>
   Di < 16384/2048                                                   :>
   Di < 8                                                            :>
   1 <= Di < 8

This means that Di needs to be a number between 1 and 7 to assure L1 hit ratio will remain 75%.

In addition, the Di number should be selected in a way to that will assure the L2 hit ratio is more than 90%.

As we mentioned that when L1 is "miss", the CPU will try to fetch the data from L2, and as we saw before, this will happen only when L1 is at the first element of the block. Since we have the requirement for the L2 hit ratio to be > 90%, let's try to make it as close as possible to 100%. The L2 can fit 131,072 elements, so let's try to fill the cache with elements on the first run and then on the subsequent 49 runs (the loop k) we will have cache hit for each element. This will give us around 98% hit ratio (which is more than 90%). Considering that we will read 128/Dj elements from the cache, and we want to fill all 13,1072 elements, we have the following constraint:

    max(i) / Di = 131072 / max(j) / Dj  :>
    max(i) / Di = 131072 / 128 / 2      :>
    2048 / Di = 2048                    :>
    Di = 1

Therefore, if we select Di to be 1 and Dj to be 2, we will have 75% cache hit on L1 and we will have 98% hit on L2 (which is more than 90% required).

In fact, thanks to the outer loop (the loop k), any number for Di that is less than 7 will give us the 98% of cache hit ratio since it will assure that we read more than 2,048 elements and less or equal to 131,072 elements.

Corrections for a real CPU

Important thing to understand is that, while we are fetching data from the array "A" from the memory segment containing the data, we also need to fetch the instructions from the code segment in memory. Considering this, in a real processor, the cache hit ratio will be lower for the amount of data that will be fetched for the instructions.

To address this, we will assume that one set of blocks (one "way") will be reserved for the code. With this assumption our hit ratio can be higher than the desired ratio, but we will assure that the ratio is not lower than the desired one.

With this assumption we will have for L1 instead of 2,048 blocks and 512 sets, we will have 1,536 organised in 512 sets (3-way cache) that can be used for "A", and for L2 instead of 16,384 blocks organised in 8,192 sets, we will have 8,192 blocks organised in 8,192 sets (or directly mapped cache) that we can use for "A". After the correction we can use the same approach as above to find values for Di and Dj.

After the correction the value of Dj will not be changed, it will still be 2, however the Di will be between 1 and 5.