Reputation: 339
I am trying to learn the usage of cache. From what I see by doing some sample experiment program, the time taken for execution of a program iterating through an array and doing some operations on the elements suddenly increases very much if I increase the array size beyond a particular value.Can anyone explain in simple terms how the cache size and array size affect the performance of mathematical operations on an array?
Upvotes: 4
Views: 3635
Reputation: 5110
If the cache is not able to accumulate the array, any reference to those non accumulated elements will result into the cache miss. The way you access the array elements is also makes the difference, because on every miss, processor brings block of data into the cache, thinking that this data might be needed soon, preparing itself to avoid future cache misses.
Example :
If you are operating on the elements from consecutive locations, performance will be improved. Because depending upon the size of cache line, processor will fetch a block of memory on first cache miss.
For example, take an instance of Matrix Multiplication, we do it in following way.
Assume : Matrices are too large to accumulate in cache.
for (i = 0; i < N; i = i + 1)
for (j = 0; j < N; j = j + 1)
A[i*N + j] = (double) random() / SOME_NUMBER;
for (i = 0; i < N; i = i + 1)
for (j = 0; j < N; j = j + 1)
B[i*N + j] = (double) random() / SOME_NUMBER;
for (i = 0; i < N; i = i + 1)
for (j = 0; j < N; j = j + 1)
for (k = 0; k < N; k = k + 1)
C[i*N + j] = C[i*N + j] + A[i*N + k]*B[k*N + j];
Here while multiplying for every row, we go on accessing the second matrix column wise. In B
first column is store in B[0],[N-1],B[2N-1].......
and so on, which are not consecutive memory locations. Hence there will be lot of cache misses. So if we could mold the solution so that we will deal with consecutive memory locations and can have some performance gain. Instead of storing the second matrix in 'Row Major Form', we can store it in 'Column Major Form' i.e. in transposed form. So now all the column elements are in consecutive locations.
A
will be accesses row wise and B
will be accessed column wise, so we have all their 'respective thing' in consecutive memory locations.
So when processor experiences the first miss, it will fetch a block of memory and hence next miss(es) will be avoided. Here we have exploited the 'Spatial Locality' and hence 'Cache Blocking'.
Now if we change the code as follows there will be significant improvement in performance
Store B in transposed form:
B[j*N + i] = random() / SOME_NUMBER;
You'll also have to access the transposed array in that order:
C[i*N + j] = C[i*N + j] + A[i*N + k]*B[j*N + k];
Upvotes: 2