Reputation: 865
I am making a simple C program to know the way of associativity of my CPU.
I know: My cache size is 32Kb (L1) and the line size is 64 bytes. From there I know there are 500 lines.
My approach is to access the first 8192 element of integer (32 kb), and see where it takes longer, if it takes longer at every x
iteration, then x
is the way of associativity.
However, the result I get shows nothing:
Here is my C code:
void run_associativity_test() {
int j = 1;
// 8192 * 4 bytes (int) is 32 kb
while (j <= 8192 * 2) {
get_element_access_time(j);
j = j + 1;
}
}
double get_element_access_time(int index) {
struct timespec start_t, end_t;
double start, end, delta;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_t);
arr[index] += 1;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end_t);
start = 1000000000 * start_t.tv_sec + start_t.tv_nsec;
end = 1000000000 * end_t.tv_sec + end_t.tv_nsec;
delta = end - start;
if (mode == 2 || mode == 3) {
printf("%d, %lf\n", index, delta);
}
return delta;
}
Is my approach wrong? How should I do it?
Also, I found a paper here that explains how to measure the way of associativity, although I couldn't understand it very well. I would be thankful if someone explain me briefly the method in the paper to measure the associativity.
Thanks!
Upvotes: 0
Views: 1159
Reputation: 23669
This might be more of a comment than an answer, but it's too big to post it as a comment.
I know: My cache size is 32Kb (L1) and the line size is 64 bytes. From there I know there are 500 lines.
The size of the cache is 2^15 bytes. So there are 2^15/2^6 = 2^9 = 512 cache lines.
while (j <= 8192 * 2) {
I thought the size of the array is 8192 int
s, not (8192 * 2) + 1 int
s.
get_element_access_time(j);
j = j + 1;
A cache line can hold 16 int
s. Accessing the elements of the array sequentially would results in at most a miss ratio of 1/16, depending on the L1D prefetcher. It's difficult to estimate the number of ways in the L1D cache using that access pattern. I think the best way to do that is to thrash at the same cache set.
Let's forget about the L1D prefetcher for the moment. Also let's only consider L1D caches that use bits 6-11 of the memory address or a subset thereof as a cache set index. For example, if the cache was 8-way associative, then there would be 2^9/2^3 = 64 sets, which means that all of the bits 6-11 are used for the index.
How to check whether the cache is 8-way associative? By accessing the same 8 cache lines that would map to the same cache set many times (such as a million or more times). If the associativity of the cache is at least 8, the execution time should be better than if the associativity is less than 8. That's because in the former case there would be only 8 misses (to the 8 cache lines) but in the latter case there would be many misses since not all cache lines can exist at the same time in the L1D cache. To make your measurements as accurate as possible, we would like to maximize the L1D miss penalty. One possible way to do that is by writing to the L1D instead of reading. This forces the L1D to write back all evicted cache lines, which will hopefully have a measurable impact on performance. Another way to do this is to maximize the number of L2D misses.
It's relatively easy to write a program that exhibits such access pattern. Once you know whether the associativity is smaller than 8 or not, you can further close in on the associativity by similarly testing for other smaller ranges of associativities. Note that you only need to write to one of the elements in a cache line. Also it's important that you make sure to flush the each write out of the write buffer of the core. Otherwise, many writes might just be performed on the write buffer rather than the cache. Essentially this can be done by using the volatile
keyword (I think?) or store fences.
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_t);
arr[index] += 1;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end_t);
This doesn't make any sense. The resolution of the timer is not that high to precisely measure the latency of a single memory write operation. So you should measure the execution time of all the accesses.
The L1D prefetcher may interfere with the measurements, potentially making the cache appear to have a higher associativity than it really is. Switch it off if possible.
If the L1D cache uses bits other than 6-11 to index the cache, virtual memory comes into play, which would make it much more complicated to accurately estimate associativity.
Upvotes: 2