Reputation: 628
So I am measuring some latencies for how long it takes it execute an add instruction on my machine to get an estimate of CPI. I first wrote a linear version that implemented a serial add (interleaved to take advantage of pipeline). I then took the same code and wrapped the additions in a loop and reevaluated it. I understand the effects of loop level parallelism but I don't get how it would be faster than the serial version which should still implement DLP. I thought maybe it was because the loop unrolling version is taking more advantage of the pipeline through register renaming so there is a higher IPC but I also tried increasing the interleaving of the linear version and it didn't really increase the performance. I would think that branch mispredictions would cause the looped version to be quite a bit slower but thats not the case. Any thoughts?
#include <time.h>
#include <stdio.h>
#define ONE asm volatile( "add $20, %eax; add $10, %ecx");
#define FIVE ONE ONE ONE ONE ONE
#define TWOFIVE FIVE FIVE FIVE FIVE FIVE
#define HUNDO TWOFIVE TWOFIVE TWOFIVE TWOFIVE
#define THOUSAND HUNDO HUNDO HUNDO HUNDO HUNDO HUNDO HUNDO HUNDO HUNDO HUNDO
#define TENTHOUSAND THOUSAND THOUSAND THOUSAND THOUSAND THOUSAND THOUSAND THOUSAND THOUSAND THOUSAND THOUSAND
#define HUNDREDK TENTHOUSAND TENTHOUSAND TENTHOUSAND TENTHOUSAND TENTHOUSAND TENTHOUSAND TENTHOUSAND TENTHOUSAND TENTHOUSAND TENTHOUSAND
#define MILLION HUNDREDK HUNDREDK HUNDREDK HUNDREDK HUNDREDK HUNDREDK HUNDREDK HUNDREDK HUNDREDK HUNDREDK
static __inline__ unsigned long long rdtsc(void){
unsigned end, start;
__asm__ __volatile__("rdtsc" : "=a"(start), "=d"(end));
return ((unsigned long long)start) | (((unsigned long long)end)<<32);
}
int main(){
double CPI = 0;
long long start, end;
long long clocks;
int i;
start=rdtsc();
for(i=0; i < 10000; i++){
HUNDREDK
}
end=rdtsc();
//calculate the time elapsed in ns per access
clocks = end-start;
CPI = clocks/(double)(200000*10000); //divide by Number of instructions * loop
printf("Cycles Per Instruction %lf, Clocks %Ld\n", CPI, clocks);
}
The difference between the two is pretty significant. The linear version has a IPC of about .2 and the looped version has an IPC of about 4. And yes I remembered to change the amount of instructions I was dividing by when evaluating the two :)
Maybe there is some confusion about the way I am doing this because file size is not the issue. I am simply removing the loop. The two process a different amount of instructions, but I also change the value I am dividing by at the end. The end with the same compiled size.
UPDATE: Thanks for the responses. There were a couple issues. The first being the way I was doing my measurements the IF time for one version was being amortized across the loop while the other wasn't. I ran some more code and interleaving of instructions from the Loop Level parallelism was greater in loop than in the serial version. The serial version still has some Write after Write dependencies that are not being renamed and are causing stalls to the pipeline.
Upvotes: 1
Views: 197
Reputation: 2532
The more likely reason is hitting the L3 cache for the MILLION
case vs. never leaving L1/L2 cache for the HUNDREDK
case.
The size of ONE
is somewhere between six and eight bytes (source) - sorry for being imprecise; not that good with assembly, but it's good enough for back-of-the-envelope calculations.
With that in mind, and assuming the best case scenario of three bytes (six bytes total for the two add
s):
HUNDREDK
~ 600 KBMILLION
~ 6 MBAssuming an L1 cache of 64 KB and L2 cache of 256 KB (source), the MILLION
code is spilling over all the way to L3 (outside of the CPU core), while HUNDREDK
is mostly prefetched in L1 and L2 (in the CPU core) - source for prefetching
Upvotes: 0
Reputation: 1795
My guess would be that because you've unrolled such a large number of iterations the code is very large. The overhead of constantly loading new pages of commands into the cache is much higher than that of iterating a variable.
In terms of branch mispredictions the loop should actually have very few. It will predict the branch that is used most often, which is right 9999/10000 times. Branch prediction is actually very good.
Upvotes: 4