peterulb
peterulb

Reputation: 2998

Why do we need the volatile keyword when the core cache synchronization is done on the hardware level?

So I’m currently listing to this talk. At minute 28:50 the following statement is made: „the fact that on the hardware it could be in main memory, in multiple level 3 caches, in four level 2 caches […] is not your problem. That’s the problem for the hardware designers.“

Yet, in java we have to declare a boolean stopping a thread as volatile, since when another thread calls the stop method, it’s not guaranteed that the running thread will be aware of this change.

Why is this the case, when the hardware level should take care of updating every cache with the correct value?

I’m sure I’m missing something here.

Code in question:

public class App {
    public static void main(String[] args) {
        Worker worker = new Worker();
        worker.start();
        try {
            Thread.sleep(10);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        worker.signalStop();
        System.out.println(worker.isShouldStop());
        System.out.println(worker.getVal());
        System.out.println(worker.getVal());
    }

    static class Worker extends Thread {
        private /*volatile*/ boolean shouldStop = false;
        private long val = 0;

        @Override
        public void run() {
            while (!shouldStop) {
                val++;
            }
            System.out.println("Stopped");
        }

        public void signalStop() {
            this.shouldStop = true;
        }

        public long getVal() {
            return val;
        }

        public boolean isShouldStop() {
            return shouldStop;
        }
    }
}

Upvotes: 2

Views: 475

Answers (2)

pveentjer
pveentjer

Reputation: 11372

You are assuming the following:

  • Compiler doesn't reorder the instructions
  • CPU performs the loads and stores in the order as specified by your program

Then your reasoning makes sense and this consistency model is called sequential consistency (SC): There is a total order over loads/stores and consistent with the program order of each thread. In simple terms: just some interleaving of the loads/stores. The requirements for SC are a bit more strict, but this captures the essence.

If Java and the CPU would be SC, then there would not be any purpose of making something volatile.

The problem is that you would get terrible performance. A lot of compiler optimizations rely on rewriting the instructions to something more efficient and this can lead to reordering of loads and stores. It could even decide to optimize-out a load or a store so that it doesn't happen. This is all perfectly fine as long as there is just a single thread involved because the thread will not be able to observe these reordering of loads/stores.

Apart from the compiler, the CPU also likes to reorder loads/store. Imagine that a CPU needs to make a write, and the cache-line for that write isn't in the right state. So the CPU would block and this would be very inefficient. Since the store is going to be made anyway, it is better to queue the store in a buffer so that the CPU can continue and as soon as the cache-line is returned in the right state, the store is written to the cache-line and then committed to the cache. Store buffering is a technique used by a lot of processors (e.g. ARM/X86). One problem with it is that it can lead to an earlier store to some address being reordering with a newer load to a different address. So instead of having a total order over all loads and stores like SC, you only get a total order over all stores. This model is called TSO (Total Store Order) and you can find it on the x86 and SPARC v8/v9. This approach assumes that the stores in the store buffer are going to be written to the cache in program order; but there is also a relaxation possible such that store in the store buffer to different cache-lines can be committed to the cache in any order; this is called PSO (Partial Store Order) and you can find it on the SPARC v8/v9.

SC/TSO/PSO are strong memory models because every load and store is a synchronization action; so they order surrounding loads/stores. This can be pretty expensive because for most instructions, as long as the data-dependency-order is preserved, any ordering is fine because:

  • most memory is not shared between different CPUs.
  • if memory is shared, often there is some external synchronization like a unlock/lock of a mutex or release-store/acquire-load that takes care of synchronization. So the synchronization can be delayed.

CPU's with weak memory models like ARM, Itanium make use of this. They make a separation between plain loads and stores and synchronizing loads/stores. And for plain loads and stores, any ordering is fine. And modern processors execute instructions out of order any way; there is a lot of parallelism inside a single CPU.

Modern processors do implement cache coherence. The only modern processor that doesn't need to implement cache coherence is a GPU. Cache coherence can be implemented in 2 ways

  • for small systems the caches can sniff the bus traffic. This is where you see MESI protocol. This technique is called is called sniffing (or snooping).
  • for larger systems you can have a directory that knows the state of each cache-line and which CPUs are sharing the cache-line and which CPU is owning the cache-line (here there is some MESI-like protocol). And all requests for cache-line go through the directory.

The cache coherence protocol make sure that the cache-line is invalidated on CPUs before a different CPU can write to the cache line. Cache coherence will give you a total order of loads/stores on a single address, but will not provide any ordering of loads/stores between different addresses.

Coming back to volatile:

So what volatile does is:

  • prevent reordering loads and stores by the compiler and CPU.
  • ensure that a load/store becomes visible; so it will the compiler from optimizing-out a load or store.
  • the load/store is atomic; so you don't get problems like a torn read/write. This includes compiler behavior like natural alignment of the field.

I have give you some technical information about what is happening behind the scenes. But to properly understand volatile, you need to understand the Java Memory Model. It is an abstract model that doesn't care about any implementation details as described above. If you would not apply volatile in your example, you would have a data race because a happens-before edge is missing between concurrent conflicting accesses.

A great book on this topic is A Primer on Memory Consistency and Cache Coherence, Second Edition. You can download it for free.

I can't recommend you any book on the Java Memory Model because it is all explained in an awful manner. Best to get an understanding of memory models in general before diving into the JMM. Probably the best sources are this doctoral dissertation by Jeremy Manson, and Aleksey Shipilëv: One Stop Page.

PS:

There are situations when you don't care about any ordering guarantees, e.g.

  • stop flag for a thread
  • progress indicators
  • blackholes for microbenchmarks.

This is where the VarHandle.getOpaque/setOpaque can be useful. It provides visibility and atomicity, but it doesn't provide any ordering guarantees with respect to other variables. This is mostly a compiler concern. Most engineers will never need this level of control.

Upvotes: 9

rzwitserloot
rzwitserloot

Reputation: 103244

What you're suggesting is that hardware designers just make the world all ponies and rainbows for you.

They cannot do that - what you want makes the notion of an on-core cache completely impossible. How could a CPU core possibly know that a given memory location needs to be synced up with another core before accessing it any further, short of just keeping the entire cache in sync on a permanent basis, completely invalidating the entire idea of an on-core cache?

If the talk is strongly suggesting that you as a software engineer can just blame hardware engineers for not making life easy for you, it's a horrible and stupid talk. I bet it's brought a little more nuanced than that.

At any rate, you took the wrong lesson from it.

It's a two-way street. The hardware engineering team works together with the JVM team, effectively, to set up a consistent model that is a good equilibrium between 'With these constraints and limited guarantees to the software engineer, the hardware team can make reliable and significant performance improvements' and 'A software engineer can build multicore software with this model without tearing their hair out'.

This happy equilibrium in java is the JMM (Java Memory Model), which primarily boils down to: All field accesses may have a local thread cache or not, you do not know, and you cannot test if it does. Essentially the JVM has an evil coin an will flip it every time you read a field. Tails, you get the local copy. Heads, it syncs first. The coin is evil in that it is not fair and will land heads through out development, testing, and the first week, every time, even if you flip it a million times. And then the important potential customer demoes your software and you start getting tails.

The solution is to make the JVM never flip it, and this means you need to establish Happens-Before/Happens-After relationships anytime you have a situation anywhere in your code where one thread writes a field and another reads it. volatile is one way to do it.

In other words, to give hardware engineers something to work with, you, the software engineer, effectively made the promise that you'll establish HB/HA if you care about synchronizing between threads. So that's your part of the 'deal'. Their part of the deal is that the hardware guarantees the behaviour if you keep up your end of the deal, and that the hardware is very very fast.

Upvotes: 1

Related Questions