liv2hak
liv2hak

Reputation: 14970

analysis of cpu cache access time

I have the following program which I with the help of someother on stackoverflow wrote to understand cachelines and CPU caches.I have the result of the calculation posted below.

    1     450.0  440.0
    2     420.0  230.0
    4     400.0  110.0
    8     390.0   60.0
   16     380.0   30.0
   32     320.0   10.0
   64     180.0   10.0
  128      60.0    0.0
  256      40.0   10.0
  512      10.0    0.0
 1024      10.0    0.0

I have plotted a graph using gnuplot which is posted below.

enter image description here

I have the following questions.

  1. is my timing calculation in milliseconds correct ? 440ms seems to be a lot of time?

  2. From the graph cache_access_1 (redline) can we conclude that the size of cache line is 32 bits (and not 64-bits?)

  3. Between the for loops in the code is it a good idea to clear the cache? If yes how do I do that programmatically?

  4. As you can see I have some 0.0 values in the result above.? What does this indicate? is the granularity of measurement too coarse?

Kindly reply.

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (512*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int k = 0;
    int count = 0;

    /*
     * MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works. 
     * It lives in a fixed-size buffer, set by the program executable's configuration according to the 
     * operating system, but its actual size is seldom checked against the available space.
     */

    /*int arr[MAX_SIZE];*/

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /*cpu clock ticks count start*/

    for(k = 0; k < 3; k++)
    {
        start = clock();
        count = 0;

        for (i = 0; i < MAX_SIZE; i++)
        { 
            arr[i] += 3;
            /*count++;*/
        }

        /*cpu clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("cpu time for loop 1 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
    }

    printf("\n");
    for (k = 1 ; k <= 1024 ; k <<= 1)
    {
        /*cpu clock ticks count start*/
        start = clock();
        count = 0;

        for (i = 0; i < MAX_SIZE; i += k) 
        {
            /*count++;*/
            arr[i] += 3;
        }

        /*cpu clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("cpu time for loop 2 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
    }
    printf("\n");

    /* Third loop, performing the same operations as loop 2, 
       but only touching 16KB of memory
     */
    for (k = 1 ; k <= 1024 ; k <<= 1)
    {
        /*cpu clock ticks count start*/
        start = clock();
        count = 0;

        for (i = 0; i < MAX_SIZE; i += k) 
        {
            count++;
            arr[i & 0xfff] += 3;
        }

        /*cpu clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("cpu time for loop 3 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
    }

    return 0;
}

Upvotes: 1

Views: 2120

Answers (1)

Brian
Brian

Reputation: 2813

Since you are on Linux, I'll answer from that perspective. I will also write with an Intel (i.e., x86-64) architecture in mind.

  1. 440 ms is probably accurate. A better way to look at the results would be time per element or access. Note that increasing your k reduces the number of elements accessed. Now, cache access 2 shows a fairly steady result of 0.9ns / access. This time is roughly comparable to 1 - 3 cycles per access (depending on CPU's clock rate). So sizes 1 - 16 (maybe 32) are accurate.
  2. No (although I will first assume you mean 32 versus 64 byte). You should ask yourself, what does "cache line size" look like? If you access smaller than the cache line, then you will miss and subsequently hit one or more times. If you are greater than or equal to the cache line size, every access will miss. At k=32 and above, the access time for access 1 is relatively constant at 20ns per access. At k=1-16, overall access time is constant, suggesting that there are approximately the same number of cache misses. So I would conclude that the cache line size is 64 bytes.
  3. Yes, at least for the last loop that is only storing ~16KB. How? Either touch a lot of other data, like another GB array. Or call an instruction like x86's WBINVD, which writes to memory and then invalidates all cache contents; however, it requires you to be in kernel-mode.
  4. As you noted, beyond size 32, the times hover around 10ms, which is showing your timing granularity. You need to either increase the time required (so that a 10ms granularity is sufficient) or switch to a different timing mechanism, which is what the comments are debating. I'm a fan of using the instruction rdtsc (read timestamp counter (i.e., cycle count)), but this can be even more problematic than the suggestions above. Switching your code to rdtsc basically required switching clock, clock_t, and CLOCKS_PER_SEC. However, you could still face clock drift if your thread migrates, but this is a fun test so I wouldn't concern myself with this issue.

More caveats: the trouble with consistent strides (like powers of 2) is that the processor likes to hide the cache miss penalty by prefetching. You can disable the prefetcher on many machines in the BIOS (see "Changing the Prefetcher for Intel Processors").

Page faults may also be impacting your results. You are allocating 500M ints or about 2GB of storage. Loop 1 tries to touch the memory so that the OS will allocate pages, but if you don't have this much available memory (not just total, as the OS, etc takes up some space) then your results will be skewed. Furthermore, the OS may start reclaiming some of the space so that you will always be page faulting on some of your accesses.

Related to the previous, the TLB is also going to have some impact on the results. The hardware keeps a small cache of mappings from virtual to physical address in a translation lookaside buffer (TLB). Each page of memory (4KB on Intel) needs a TLB entry. So your experiment is going to need 2GB / 4KB => ~500,000 entries. Most TLBs hold less than 1000 entries, so the measurements are also skewed by this miss. Fortunately, it is only once every 4KB or 1024 ints. It is possible that malloc is allocating "large" or "huge" pages for you, for more details - Huge Pages in Linux.

Another experiment would be to repeat the third loop, but change the mask that you are using, so that you can observe the size of each cache level (L1, L2, maybe L3, rarely L4). You may also find that different cache levels use different cacheline sizes.

Upvotes: 5

Related Questions