bholanath
bholanath

Reputation: 1753

Disabled Hardware prefetcher's effect not reflected in access time ,not showing any difference in access time

I have disabled h/w prefetcher in my system ( both core2duo and core i7 system). I follow the link to disable it . How do I programmatically disable hardware prefetching?

Also I have disabled gcc optimization with -O0 option while compiling the program. After disabling H/W prefetching I am accessing consecutive sets from cache(by accessing array index which maps to consecutive sets in cache) , but still I am getting same result as before , when H/W prefetching was enabled.

As per my understanding, after seeing stride pattern, H/W prefetcher enabled and it prefetch two consecutive cache lines ( 128 Bytes) from higher cache/main memory and loaded into lower cache.So when a cache line is accessed, there is a miss for the cache line and it is loaded from higher cache, also the next cache line pre-loaded due to H/W prefetcher . So We get higher access time for first cache line as it is loaded from higher level of cache ,but access time for the next cache line is less as it is already in L1 cache due to H/W prefetcher already loaded it.

Now, if H/W prefetcher is disabled, so although there is a stride pattern is detected, the H/W prefetcher will not load next cache lines from higher cache in advance during the access of adjacent previous cache lines, and for the next cache line there will be a miss and it will be loaded from next level of cache and so higher access time for this cache lines is expected.

But, in reality , even after disabling H/W prefetcher I am not getting higher access time for consecutive cache lines, means H/W prefetcher is not disable at all in my machine .

Am I correct?

Also there is L2 streaming prefetcher ( Adjacent cache line )prefetcher , which by default is disabled.(BIT 19 in MSR)

How To check H/W prefetcher is disabled or not ? Is there any way to check whether H/W prefetcher is disabled ot not ?

Here is my code

#include <sys/time.h>
#include<stdlib.h>
#include<stdio.h>
#include<math.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
int main()
{
int cacheArray[10000],temp;
int i, block = 12;
unsigned long t1,t2,total;
struct timespec tim1,tim2;

for(i=0;i<5;i++)
{
clock_gettime(CLOCK_REALTIME, &tim1);
temp = cacheArray[block*16];
clock_gettime(CLOCK_REALTIME, &tim2);

t1=tim1.tv_sec*1000000000+(tim1.tv_nsec);
t2=tim2.tv_sec*1000000000+(tim2.tv_nsec);
total = t2 - t1;
printf("Accessing %d th block took %lu nanosec \n", block, total);
block =block + 1;
clock_gettime(CLOCK_REALTIME, &tim1);
temp = cacheArray[block*16];
clock_gettime(CLOCK_REALTIME, &tim2);
t1=tim1.tv_sec*1000000000+(tim1.tv_nsec);
t2=tim2.tv_sec*1000000000+(tim2.tv_nsec);
total = t2 - t1;
printf("Accessing %d th block took %lu nanosec \n", block, total);
block = block + 20;
}
}

Here is my sample output :

Accessing 12 th block took 137 nanosec 
Accessing 13 th block took 54 nanosec 
Accessing 33 th block took 39 nanosec 
Accessing 34 th block took 37 nanosec 
Accessing 54 th block took 687 nanosec 
Accessing 55 th block took 93 nanosec 
Accessing 75 th block took 108 nanosec 
Accessing 76 th block took 107 nanosec 
Accessing 96 th block took 109 nanosec 
Accessing 97 th block took 106 nanosec 

I am expecting same/higher access time for consecutive cache lines/blocks. Why the next cache block/line is loaded into cache although H/W prefetcher is disabled , so theoretically next cache lines must not be loaded into cache in advance when they are not accessed.

Any suggestion or links will be highly appreciated. Thanks in advance .

Upvotes: 1

Views: 494

Answers (1)

bholanath
bholanath

Reputation: 1753

UPDATED PROGRAM for getting correct expected result after disabling Hardware prefetcher

Here I have done multiple access to same element at index=i and find average access time at that index=i by taking average of multiple access and by this way I am getting correct expected result for all index i*16 and index (i+1)*16. As Hardware Prefetcher is disabled, I must get higher access time for cache line i and cache line (i+1) and my result also show that.

Note: Cache block size=64B, and I am using integer array, and as int takes 4Bytes, that's why index*16 and (index+1)*16 will be in consecutive cache line and in different cache lines.

#include <sys/time.h>
#include<stdlib.h>
#include<stdio.h>
#include<math.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>

inline uint64_t rdtsc()
{
    unsigned long a, d;
    asm volatile ("rdtsc" : "=a" (a), "=d" (d));        
    return a | ((uint64_t)d << 32);
}

int main()
{

volatile uint64_t start, end, total;

int cacheArray[10000],temp;
int i,j, index ;

unsigned long long access_time1[100];
unsigned long long access_time2[100];


for(i=0;i<100;i++)
{
access_time1[i]=0;
access_time2[i]=0;
}


for(j=0;j<10000;j++)
{
    for(i=10;i<100;i+=20)
    {
    index=i;

    start = rdtsc();
    temp = cacheArray[index*16];
    end = rdtsc();

    total = end - start;
    access_time1[index]+=total;
    //printf("Accessing %d th block took %llu cycles \n", index, total);

    index = index + 1;

    start = rdtsc();
    temp = cacheArray[index*16];
    end = rdtsc();

    total =  end - start;
    access_time2[index]+=total;
    //printf("Accessing %d th block took %llu cycles \n\n", index, total);

    }
}


for(i=10;i<100;i+=20)
{

printf("Accessing %d th block took %llu nanosec \n", i, access_time1[i]/10000);
printf("Accessing %d th block took %llu nanosec \n\n", i+1, access_time2[i+1]/10000);

}


return 0;
}

Accessing 10 th block took 57 nanosec 
Accessing 11 th block took 63 nanosec 

Accessing 30 th block took 62 nanosec 
Accessing 31 th block took 66 nanosec 

Accessing 50 th block took 59 nanosec 
Accessing 51 th block took 62 nanosec 

Accessing 70 th block took 62 nanosec 
Accessing 71 th block took 65 nanosec 

Accessing 90 th block took 66 nanosec 
Accessing 91 th block took 71 nanosec 

Upvotes: 1

Related Questions