Reputation: 284
I'm trying to make an accurate measurement of memory access to different cache levels, and came up with this code for probing:
__asm__ __volatile__(
"xor %%eax, %%eax \n"
"xor %%edi, %%edi \n"
"xor %%edx, %%edx \n"
/* time measurement */
"lfence \n"
"rdtsc \n"
"shl $32, %%rdx \n"
"or %%rdx, %%rax \n"
"movq %%rax, %%rdi \n"
/* memory access */
"movq (%%rsi), %%rbx\n"
/* time measurement */
"rdtscp \n"
"shl $32, %%rdx \n"
"or %%rdx, %%rax \n"
"movq %%rax, %%rsi \n"
"cpuid \n"
: /* output operands */
"=S"(t2), "=D"(t1)
: /* input operands */
"S" (mem)
: /* clobber description */
"ebx", "ecx", "edx", "cc", "memory"
);
However the L1 and L2 cache access just differ by 8 cycles and the results are fluctuating to much, so I decided to check how much impact the surrounding code (apart from the actual memory access) has on the timing:
__asm__ __volatile__(
"xor %%eax, %%eax \n"
"xor %%edi, %%edi \n"
"xor %%edx, %%edx \n"
/* time measurement */
"lfence \n"
"rdtsc \n"
"shl $32, %%rdx \n"
"or %%rdx, %%rax \n"
"movq %%rax, %%rdi \n"
/* memory access */
//"movq (%%rsi), %%rbx\n"
/* time measurement */
"rdtscp \n"
"shl $32, %%rdx \n"
"or %%rdx, %%rax \n"
"movq %%rax, %%rsi \n"
"cpuid \n"
: /* output operands */
"=S"(t2), "=D"(t1)
: /* input operands */
"S" (mem)
: /* clobber description */
"ebx", "ecx", "edx", "cc", "memory"
);
The results looked like this:
./cache_testing
From Memory: 42
From L3: 46
From L2: 40
From L1: 38
./cache_testing
From Memory: 40
From L3: 38
From L2: 36
From L1: 40
I'm aware that I don't hit the different cache levels by purpose at the moment, but I wonder why the timing, in case of the missing memory access is fluctuating so much. The code is running as SCHED_FIFO with the highest priority, pinned to one CPU and shouldn't be dispatched while running. Can anybody tell me if I can improve my code and thereby the results in any way?
Upvotes: 1
Views: 816
Reputation: 364128
To fix your measuring code, you're right that you need to measure an empty setup as a baseline to subtract the measurement overhead.
Also keep in mind that the TSC counts reference cycles, not core clock cycles, so for this to work you need to make sure your CPU is always running at the same speed. (e.g. disable turbo and use a warm-up loop to get the CPU up to top speed, then TSC counts should match core cycles if you aren't overclocking.)
That probably explains the fluctuation.
I usually measure stuff with perf counters, not RDTSC.
But I think you should be using a serializing instruction (like CPUID) before the first RDTSC. Using a CPUID after the second RDTSC probably isn't useful. rdstcp
for the second measurement is useful, since it means the timestamp comes from after the load has executed. (The manual says "executed"; IDK if that means "retired" or just literally executed by the a load port.)
So IIRC, your best bet is:
# maybe set eax to something before CPUID
cpuid
rdtsc
shl $32, %%rdx
lea (%%rax, %%rdx), %%rsi
... code under test
# CPUID here, too, if you can only use rdtsc instead of rdtscp
rdtscp
shl $32, %%rdx
or %%rdx, %%rax
sub %%rsi, %%rax
# time difference in RAX
If the code under test competes for the same ALU ports as shift/LEA, you could just mov
the low 32 of the first RDTSC result to another register. Instead of dealing with the high 32 at all. If you assume that the difference in timestamps is much less than 2^32, you don't need the high 32 bits of either count.
I've read that measuring tiny sequences like this on modern CPUs can be done better with performance counters than with the TSC. Agner Fog's test programs include code for using perf counters from inside a program to measure something. This can let you measure core cycles regardless of turbo or non-turbo, because the core clock cycles performance-counter actually counts at one per physical clock cycle.
Upvotes: 2