rajesh kedia
rajesh kedia

Reputation: 41

perf reports misses larger than total accesses

I am using the below code to understand the behavior of cache misses. I am compiling the code using gcc and then reporting cache statistics from perf command in unix. What I observe is that the number of misses reported is much larger than the total number of accesses. Couldn't figure out the reason for the same. If anyone else has seen similar behavior and could throw some light, it would really help.

C-code:

#define N 30000

static char array[N][N];

int main(void) {

register int i,j;

    for (i=0;i<N;i++)

        for(j=0;j<N;j++)

            array[j][i]++;

return 0;

}

command for compile:

gcc test1.c -O0 -o test1.out

command for running the perf tool:

perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./test1.out

Upvotes: 4

Views: 1000

Answers (1)

Abdul
Abdul

Reputation: 13

Recently, I have come across the document which mentions "If an instruction fetch misses in the L1 Icache, the fetch may be retried several times before the instructions have been returned to the L1 Icache. The L1 Icache miss event might be incremented every time the fetch is attempted, while the L2 cache access counter may only be incremented on the initial fetch." I am not sure it is true for data-cache or not.

If L1 misses are retried before it is being loaded from L2, it is same for L1 loads as well. For the array of 30000 by 30000, you are accessing the data 30000 bytes apart in each iteration and the l1-dcache-load fails, hence l1-dcache-load-misses counter increases. Then before it loads data from L2 or LLC, it retires to see if data really misses, and increases the L1-dcache-load-misses counter further. This could be a good explanation for your case of L1-dcache-load-misses is higher than L1-dcache-load.

On extra note:

When you measure the cpu performance of a program by perf, the measurement contains cpu cycles consumed by your program and the perf program itself. So, be sure to take into account the cycles that is only used by your program. perf-stat may not allow you to do that. You can use perf-record and perf-report to see the cache-loads and cache-misses by your program.

As an example, you can use following perf command to record measurements.

perf record -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./test1.out

Then you can use the following perf command to see the recording only for your test1.out program.

perf report --dsos=test1.out

Upvotes: 1

Related Questions