Reputation: 14970
I have the following program which I with the help of someother on stackoverflow wrote to understand cachelines and CPU caches.I have the result of the calculation posted below.
1 450.0 440.0
2 420.0 230.0
4 400.0 110.0
8 390.0 60.0
16 380.0 30.0
32 320.0 10.0
64 180.0 10.0
128 60.0 0.0
256 40.0 10.0
512 10.0 0.0
1024 10.0 0.0
I have plotted a graph using gnuplot which is posted below.
I have the following questions.
is my timing calculation in milliseconds correct ? 440ms seems to be a lot of time?
From the graph cache_access_1 (redline) can we conclude that the size of cache line is 32 bits (and not 64-bits?)
Between the for loops in the code is it a good idea to clear the cache? If yes how do I do that programmatically?
As you can see I have some 0.0
values in the result above.?
What does this indicate? is the granularity of measurement too
coarse?
Kindly reply.
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#define MAX_SIZE (512*1024*1024)
int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int k = 0;
int count = 0;
/*
* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
* It lives in a fixed-size buffer, set by the program executable's configuration according to the
* operating system, but its actual size is seldom checked against the available space.
*/
/*int arr[MAX_SIZE];*/
int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
/*cpu clock ticks count start*/
for(k = 0; k < 3; k++)
{
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i++)
{
arr[i] += 3;
/*count++;*/
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 1 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
printf("\n");
for (k = 1 ; k <= 1024 ; k <<= 1)
{
/*cpu clock ticks count start*/
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i += k)
{
/*count++;*/
arr[i] += 3;
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 2 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
printf("\n");
/* Third loop, performing the same operations as loop 2,
but only touching 16KB of memory
*/
for (k = 1 ; k <= 1024 ; k <<= 1)
{
/*cpu clock ticks count start*/
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i += k)
{
count++;
arr[i & 0xfff] += 3;
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 3 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
return 0;
}
Upvotes: 1
Views: 2120
Reputation: 2813
Since you are on Linux, I'll answer from that perspective. I will also write with an Intel (i.e., x86-64) architecture in mind.
More caveats: the trouble with consistent strides (like powers of 2) is that the processor likes to hide the cache miss penalty by prefetching. You can disable the prefetcher on many machines in the BIOS (see "Changing the Prefetcher for Intel Processors").
Page faults may also be impacting your results. You are allocating 500M ints or about 2GB of storage. Loop 1 tries to touch the memory so that the OS will allocate pages, but if you don't have this much available memory (not just total, as the OS, etc takes up some space) then your results will be skewed. Furthermore, the OS may start reclaiming some of the space so that you will always be page faulting on some of your accesses.
Related to the previous, the TLB is also going to have some impact on the results. The hardware keeps a small cache of mappings from virtual to physical address in a translation lookaside buffer (TLB). Each page of memory (4KB on Intel) needs a TLB entry. So your experiment is going to need 2GB / 4KB => ~500,000 entries. Most TLBs hold less than 1000 entries, so the measurements are also skewed by this miss. Fortunately, it is only once every 4KB or 1024 ints. It is possible that malloc is allocating "large" or "huge" pages for you, for more details - Huge Pages in Linux.
Another experiment would be to repeat the third loop, but change the mask that you are using, so that you can observe the size of each cache level (L1, L2, maybe L3, rarely L4). You may also find that different cache levels use different cacheline sizes.
Upvotes: 5