Cache performance

Question

I'm going through CSAPP book, and I'm not sure the difference between two following loops in terms of cache performance:

Here cache has 2048 byte, direct-mapped (number of lines is 1) and has 16-byte blocks, and we define the following struct:

struct algae_position {
    int x;
    int y;
};
struct algae_position grid[16][16];

Code 1:

for (i = 0; i < 16; i++) {
    for (j = 0; j < 16; j++) {
        total_x += grid[i][j].x;
    }
}
for (i = 0; i < 16; i++) {
    for (j = 0; j < 16; j++) {
        total_y += grid[i][j].y;
    }
}

and code 2:

for (i = 0; i < 16; i++) {
    for (j = 0; j < 16; j++) {
        total_x += grid[i][j].x;
        total_y += grid[i][j].y;
    }
}

My thoughts: We always have the pattern of miss, hit, miss, hit since a line can hold only two grid element. Therefore we always have 50% miss rate.

However according to the book, code 2 will have a miss rate of 25% because the cache can hold the whole grid array. How should I understand this problem?

Nominal Animal · Accepted Answer

Assume 4-byte ints, so that each struct algae_position is eight bytes long, and each cache line can contain two structures. Also assume the array is aligned on a cache line boundary (starts at the beginning of a cache line).

In the first code, in the first loop, all accesses to grid[i][j].x with even j (0, 2, 4, ..) are misses, with the access following it (odd j, 1, 3, 5, ..) are hits. The first loop has a 50% cache hit rate. However, because the entire array fits in cache, the accesses in the second loop can never miss, the second loop has a 100% cache hit rate. Thus, the first code has an overall 75% cache hit rate.

(In practice, some of the data ends up being evicted from the cache early, at least some of the time, so this code has a real-world cache hit rate between 50% and 75%.)

In the second code, the very first access, grid[0][0].x is a miss. However, that causes the entire cache line to be read into memory, so that grid[0][0].y, grid[0][1].x, and grid[0][1].y are all hits. The next access is again a miss, so this pattern of four accesses with initial miss followed by three hits repeats. Thus, the second code has an overall 75% cache hit rate. The cache size is not relevant here.

In practice, the second code is better (measurably better for larger arrays). Not only does it only depend on just one cache line (plus prediction, which current CPUs are pretty good at), but it also computes the three other values during the interval that would otherwise be wasted waiting for the cache line to load. The first code not only wastes that time, but it also relies on the CPU not interrupting the function, and having all data remain cached for the entire loop. So, not only does the first code "waste" cache (rely on lots of cache being available), but it also wastes CPU cycles by not doing work it could do, while waiting for the next cache line to be loaded.

These latency issues are something I wish more programmers would focus on.

Cache performance

Answers (1)

Related Questions