Omiting processor cache

Question

I have a question I had been given a while ago during the job interview, I was wandering about the data processor cache. The question itself was connected with volatile variable, how can we not optimize the memory access for those variables. From my understanding when we read the volatile variable we need to omit the processor cache. And this is what my question is about. What is happening in such cases, is entire cache being flushed when the access for such variable is executed? Or there is some register setting that caching should be omitted for a memory region? Or is there a function for reading memory without looking in the cache? Or is it architecture dependent.

Thanks in advance for your time and answers.

Leeor · Accepted Answer

There is some confusion here - the memory your program uses (through the compiler), is in fact an abstraction, maintained together by the OS and the processor. As such, you don't "need" to worry about paging, swapping, physical address space and performance.

Wait, before you jump and yell at me for talking nonesence - that was not to say you shouldn't care about them, when optimizing your code you might want to know what actually happens, so you have a set of tools to assist you (SW prefetches for example), as well as a rough idea on how the system works (cache sizes and hierarchy), allowing you to write optimized code. However, as I said, you don't have to worry about this, and if you don't - it's guaranteed to work "under the hood", to an extent. The cache for example, is guaranteed to maintain coherency even when working with shared data (that's maintained through a set of pretty complicated HW protocols), and even in cases of virtual address aliases (multiple virt addresses pointing to the same physical one). But here comes the "to an extent" part - in some cases you have to make sure you use it correctly. If you want to do memory-mapped IO for e.g., you should define it properly so that the processor knows it shouldn't be cached. The compiler isn't likely to do this for you implicitly, it probably won't even know.

Now, volatile lives in an upper level, it's part of the contract between the programmer and his compiler. It means the compiler isn't allowed to do all sorts of optimizations with this variable, that would be unsafe for the program even within the memory model abstraction. These are basically cases where the value can be modified externally at any point (through interrupt, mmio, other threads, ...). Keep in mind that the compiler still lives above the memory abstraction, if it decides to write something to memory or read it, aside from possible hints it relies completely on the processor to do whatever it needs to make this chunk of memory close at hand while maintaining correctness. However, a compiler is allowed much more freedom than the HW - it could decide to move reads/writes or eliminate variables alltogether, something which the CPU in most cases isn't allowed to, so you need to prevent that from happening if it's unsafe. Some nice examples of when that happens can be found here - http://www.barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword

So while a volatile hint limits the freedom of the compiler inside the memory model, it doesn't necessarily limits the underlying HW. You probably don't want it to - say you have a volatile variable that you want to expose to other threads - if the compiler made it uncacheable it would ruin the performance (and without need). If on top of that you also want to protect the memory model from unsafe caching (which are just a subset of the cases volatile might come in handy), you'll have to do so explicitly.

EDIT: I felt bad for not adding any example, so to make it clearer - consider the following code:

int main() {
    int n = 20;
    int sum = 0;
    int x = 1;
    /*volatile */ int* px = &x;

    while (sum < n) {
        sum+= *px;
        printf("%d
", sum);
    }
    return 0;
}

This would count from 1 to 20 in jumps of x, which is 1. Let's see how gcc -O3 writes it:

0000000000400440 :
  400440:       53                      push   %rbx
  400441:       31 db                   xor    %ebx,%ebx
  400443:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400448:       83 c3 01                add    $0x1,%ebx
  40044b:       31 c0                   xor    %eax,%eax
  40044d:       be 3c 06 40 00          mov    $0x40063c,%esi
  400452:       89 da                   mov    %ebx,%edx
  400454:       bf 01 00 00 00          mov    $0x1,%edi
  400459:       e8 d2 ff ff ff          callq  400430 <__printf_chk@plt>
  40045e:       83 fb 14                cmp    $0x14,%ebx
  400461:       75 e5                   jne    400448 
  400463:       31 c0                   xor    %eax,%eax
  400465:       5b                      pop    %rbx
  400466:       c3                      retq

note the add $0x1,%ebx - since the variable is considered "safe" enough by the compiler (volatile is commented out here), it allows itself to consider it as loop invariant. In fact, if I had not printed something on each iteration, the entire loop would have been optimized away since gcc can tell the final outcome pretty easily.

However, uncommenting the volatile keyword, we get -

0000000000400440 :
  400440:       53                      push   %rbx
  400441:       31 db                   xor    %ebx,%ebx
  400443:       48 83 ec 10             sub    $0x10,%rsp
  400447:       c7 04 24 01 00 00 00    movl   $0x1,(%rsp)
  40044e:       66 90                   xchg   %ax,%ax
  400450:       8b 04 24                mov    (%rsp),%eax
  400453:       be 4c 06 40 00          mov    $0x40064c,%esi
  400458:       bf 01 00 00 00          mov    $0x1,%edi
  40045d:       01 c3                   add    %eax,%ebx
  40045f:       31 c0                   xor    %eax,%eax
  400461:       89 da                   mov    %ebx,%edx
  400463:       e8 c8 ff ff ff          callq  400430 <__printf_chk@plt>
  400468:       83 fb 13                cmp    $0x13,%ebx
  40046b:       7e e3                   jle    400450 
  40046d:       48 83 c4 10             add    $0x10,%rsp
  400471:       31 c0                   xor    %eax,%eax
  400473:       5b                      pop    %rbx
  400474:       c3                      retq   
  400475:       90                      nop

now the add operand is being read from the stack, as the compilers is led to suspect someone might change it. It's still caches, and as a normal writeback-typed memory it would catch any attempt to modify it from another thread or DMA, and the memory system would provide the new value (most likely the cache line would be snooped and invalidated, forcing the CPU to fetch the new value from whichever core owns it now). However, as I said, if x should not have been a normal cacheable memory address, but rather ment to be some MMIO or something else that might change silently beneath the memory system - then the cached value would be wrong (that's why MMIO shouldn't be cached), and the compiler would never know that even though it's considered volatile.

By the way - using volatile int x and adding it directly would produce the same result. Then again - making x or px global variables would also do that, the reason being - the compiler would suspect that someone might have access to it, and therefore would take the same precautions as with an explicit volatile hint. Interestingly enuogh, the same goes for making x local, but copying its address into a global pointer (but still using x directly in the main loop). The compiler is quite cautious. That is not to say it's 100% full proof, you could in theory keep x local, have the compiler do the optimizations, and then "guess" the address somewhere from the outside (another thread for e.g.). This is when volatile does come in handy.

Omiting processor cache

Answers (2)

Related Questions