Measuring memory access time x86

Question

I try to measure cached / non cached memory access time and results confusing me.

Here is the code:

  1 #include                                                               
  2 #include                                                           
  3 #include                                                              
  4                                                                                 
  5 #define SIZE 32*1024                                                            
  6                                                                                 
  7 char arr[SIZE];                                                                 
  8                                                                                 
  9 int main()                                                                      
 10 {                                                                               
 11     char *addr;                                                                 
 12     unsigned int dummy;                                                         
 13     uint64_t tsc1, tsc2;                                                        
 14     unsigned i;                                                                 
 15     volatile char val;                                                          
 16                                                                                 
 17     memset(arr, 0x0, SIZE);                                                     
 18     for (addr = arr; addr < arr + SIZE; addr += 64) {                           
 19         _mm_clflush((void *) addr);                                             
 20     }                                                                           
 21     asm volatile("sfence
	"                                                   
 22             :                                                                   
 23             :                                                                   
 24             : "memory");                                                        
 25                                                                                 
 26     tsc1 = __rdtscp(&dummy);                                                    
 27     for (i = 0; i < SIZE; i++) {                                                
 28         asm volatile (                                                          
 29                 "mov %0, %%al
	"  // load data                                
 30                 :                                                               
 31                 : "m" (arr[i])                                                  
 32                 );                                                              
 33                                                                                 
 34     }                                                                           
 35     tsc2 = __rdtscp(&dummy);                                                    
 36     printf("(1) tsc: %llu
", tsc2 - tsc1);                                     
 37                                                                                 
 38     tsc1 = __rdtscp(&dummy);                                                    
 39     for (i = 0; i < SIZE; i++) {                                                
 40         asm volatile (                                                          
 41                 "mov %0, %%al
	"  // load data                                
 42                 :                                                               
 43                 : "m" (arr[i])                                                  
 44                 );                                                              
 45                                                                                 
 46     }                                                                           
 47     tsc2 = __rdtscp(&dummy);                                                    
 48     printf("(2) tsc: %llu
", tsc2 - tsc1);                                     
 49                                                                                 
 50     return 0;                                                                   
 51 }

the output:

(1) tsc: 451248
(2) tsc: 449568

I expected, that first value would be much larger because caches were invalidated by clflush in case (1).

Info about my cpu (Intel(R) Core(TM) i7 CPU Q 720 @ 1.60GHz) caches:

Cache ID 0:
- Level: 1
- Type: Data Cache
- Sets: 64
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true

Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Sets: 128
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 4
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true

Cache ID 2:
- Level: 2
- Type: Unified Cache
- Sets: 512
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 262144 bytes (256 kb)
- Is fully associative: false
- Is Self Initializing: true

Cache ID 3:
- Level: 3
- Type: Unified Cache
- Sets: 8192
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 12
- Total Size: 6291456 bytes (6144 kb)
- Is fully associative: false
- Is Self Initializing: true

Code disassembly between two rdtscp instructions

  400614:       0f 01 f9                rdtscp 
  400617:       89 ce                   mov    %ecx,%esi
  400619:       48 8b 4d d8             mov    -0x28(%rbp),%rcx
  40061d:       89 31                   mov    %esi,(%rcx)
  40061f:       48 c1 e2 20             shl    $0x20,%rdx
  400623:       48 09 d0                or     %rdx,%rax
  400626:       48 89 45 c0             mov    %rax,-0x40(%rbp)
  40062a:       c7 45 b4 00 00 00 00    movl   $0x0,-0x4c(%rbp)
  400631:       eb 0d                   jmp    400640 
  400633:       8b 45 b4                mov    -0x4c(%rbp),%eax
  400636:       8a 80 80 10 60 00       mov    0x601080(%rax),%al
  40063c:       83 45 b4 01             addl   $0x1,-0x4c(%rbp)
  400640:       81 7d b4 ff 7f 00 00    cmpl   $0x7fff,-0x4c(%rbp)
  400647:       76 ea                   jbe    400633 
  400649:       48 8d 45 b0             lea    -0x50(%rbp),%rax
  40064d:       48 89 45 e0             mov    %rax,-0x20(%rbp)
  400651:       0f 01 f9                rdtscp

Looks like I'am missing / misunderstand something. Could you suggest?

Peter Cordes · Accepted Answer

mov %0, %%al is so slow (one cache line per 64 clocks, or per 32 clocks on Sandybridge specifically (not Haswell or later)) that you might bottleneck on that whether or not your loads are ultimately coming from DRAM or L1D.

Only every 64-th load will miss in cache, because you're taking full advantage of spatial locality with your tiny byte-load loop. If you actually wanted to test how fast the cache can refill after flushing an L1D-sized block, you should use a SIMD movdqa loop, or just byte loads with a stride of 64. (You only need to touch one byte per cache line).

To avoid the false dependency on the old value of RAX, you should use movzbl %0, %eax. This will let Sandybridge and later (or AMD since K8) use their full load throughput of 2 loads per clock to keep the memory pipeline closer to full. Multiple cache misses can be in flight at once: Intel CPU cores have 10 LFBs (line fill buffers) for lines to/from L1D, or 16 Superqueue entries for lines from L2 to off-core. See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?. (Many-core Xeon chips have worse single-thread memory bandwidth than desktops/laptops.)

But your bottleneck is far worse than that!

You compiled with optimizations disabled, so your loop uses addl $0x1,-0x4c(%rbp) for the loop counter, which gives you at least a 6-cycle loop-carried dependency chain. (Store/reload store-forwarding latency + 1 cycle for the ALU add.) http://agner.org/optimize/

(Maybe even higher because of resource conflicts for the load port. i7-720 is a Nehalem microarchitecture, so there's only one load port.)

This definitely means your loop doesn't bottleneck on cache misses, and will probably run about the same speed whether you used clflush or not.

Also note that rdtsc counts reference cycles, not core clock cycles. i.e. it will always count at 1.7GHz on your 1.7GHz CPU, regardless of the CPU running slower (powersave) or faster (Turbo). Control for this with a warm-up loop.

You also didn't declare a clobber on eax, so the compiler isn't expecting your code to modify rax. You end up with mov 0x601080(%rax),%al. But gcc reloads rax from memory every iteration, and doesn't use the rax that you modify, so you aren't actually skipping around in memory like you might be if you'd compiled with optimizations.

Hint: use volatile char * if you want to get the compiler to actually load, and not optimize it to fewer wider loads. You don't need inline asm for this.

Measuring memory access time x86

Answers (1)

Related Questions