Why should I explicitly make an operation atomic?

Consider the following code:

int x=0;

#pragma omp parallel num_threads(4) default(none) shared(x)
 {
  for(int i=0; i<1000; ++i)
   x++;
 }
cout << x << endl;

The expected out put is 4000. However what I usually see is something between 2500-3500. I already know why, (because I didn't make this operation atomic). Until today I thought this was totally acceptable, but then something came to my mind:

Cache coherency protocols are supposed to keep data consistent among cores. That is, if a core wants to write to a variable, it must first gain exclusive access to it, and then proceed with write operation.

Now i'm wondering why would I get any result other than 4000, even when I don't specify it's an atomic operation?

One thing that comes to my mind is that maybe when the code is compiled into machine code it possibly create two copies of x.

EDIT:
What I think of cache coherency protocols is explained in the following figure taken from here(Page 19):
Get Exclusive
Now I know this figure is for a multi-processor(and not multi-core) systems using bit-vector protocol, but I think something close to this is used in Intel processors that are using MESI protocol. If this is true, then the reader won't get a copy of requested value until all invalidations are acknowledged. Correct me if I'm wrong. I've tried searching for details of how MESI protocol works, but I haven't found much.

Upvotes: 0

Answers (3)

Olof Forshell

Reputation: 3274

Cache coherency means that as soon as one core (or a bus mastering device) writes to a memory location that location is invalidated in other (all) caches that contain it. This forces them to reload the location (in the form of a 64-byte cache line) before they can access it (R or W) the next time.

So cache coherency is not data coherency it's just a guarantee that an updated location will be invalidated asap. Caches can't do more, they're always way behind the executing cores and somewhat behind each other. If one core updates a location and another does the same slightly later both caches concerned will think their location is valid (and they will both probably invalidate each other's cache lines).

What kind of a guarantee is this if the data isn't guaranteed to be valid? It's the best that can be done under the circumstances. The choice is between completely synchronized cores (which would run exceedingly slowly) and running at full speed with caches (with specific, defined consequences and working solutions to handle them). The solutions are essentially very short slowdowns such that everything is synchronized afterwards. These intermittent, very short slowdowns should be weighed against the permanent slowdown of fully synchronized cores.

Under normal circumstances there is no contention over the same location from different cores or bus-mastering devices. But once they begin to share certain memory locations the solutions provided allow the programmer to make sure that the necessary synchronization can be implemented.

This seems like a pretty good paper on caches ... and this.

Edit: to be more precise on cache coherency: when a core writes to a location its own cache system will first make sure that the pertinent cache information in the caches of other cores is invalidated. So after a write only the cache of the core that wrote to the location will contain cached data about the location.

Upvotes: 1

Gray

Reputation: 116908

Why do you think that the value x is stored in a coherent cache location? Each core has it's own cache memory but there are no guarantees of coherency between those caches unless you ask for them. And there is no guarantee about the order of the cache updates -- nor the frequency. One thread could add 100 to x and then the cache could be synchronized overwriting the other thread's increment of 20.

The first time x is referenced, it gets pulled into a processor (or core) memory cache from central memory. Most likely each thread will get a 0 the first time. But it may be at the very end of the loop that anything is written back to central memory and each thread might easily write back 1000 to x. There is certainly no guarantees that x will be updated with each x++ -- either written or re-read. In fact, you are pretty much guaranteed that x will not be updated each time unless it is synchronized. In terms of this tight loop, x will never be evicted from the cache so it will never be re-read automatically. Even if it wasn't such a tight loop, making some guess about when x will be evicted would be extremely hard to do -- even if you were always working on the same hardware.

Lastly, the word really is "synchronization" instead of "atomic". x++ is rarely an atomic operation these days (it is actually read, increment, store) but it certainly is not synchronized between cache memory locations or central storage.

Upvotes: 2

Tudor

Reputation: 62459

I agree 100% with Gray's answer. However, the non-atomicity of increment is a known issue and it's not just applicable to multi-cores, as it can just as well occur on a single core machine.

The fact is that x++ is (usually) actually accomplished through several assembler instructions, for example:

load r,[x]  ; load memory into register
incr r       ; increment register
stor [x],r  ; store register back to memory

so although it's a single operation in the C program, it's actually a non-atomic sequence of assembler instructions that can be interrupted at any point. So even on a single core machine, a thread might be interrupted before completing the increment, thus leaving the variable in an inconsistent state.

Some compilers or architectures may indeed treat increment as atomic, but it's not a good idea to assume this.

Upvotes: 4

Why should I explicitly make an operation atomic?

Answers (3)

Related Questions