tmyklebu
tmyklebu

Reputation: 14215

Data races, UB, and counters in C++11

The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:

int num_times_done_it; // global

void doit() {
  ++num_times_done_it;
  // do something
}

void report_stats() {
  printf("called doit %i times\n", num_times_done_it);
  // and probably some other stuff too
}

Unfortunately, if multiple threads can call doit without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats can be called concurrently with doit absent any synchronisation, there's another data race between the thread modifying num_times_done_it and the thread reporting its value.

Often, the programmer just wants a mostly-right count of the number of times doit has been called with as little overhead as possible.

(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)

Apparent non-solutions:

Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?

EDIT: It seems that we can do this in a slightly indirect way using memory_order_relaxed:

atomic<int> num_times_done_it;
void doit() {
  num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
                          memory_order_relaxed);
  // as before
}

However, gcc 4.8.2 generates this code on x86_64 (with -O3):

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
   6:   83 c0 01                add    $0x1,%eax
   9:   89 05 00 00 00 00       mov    %eax,0x0(%rip)

and clang 3.4 generates this code on x86_64 (again with -O3):

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
   6:   ff c0                   inc    %eax
   8:   89 05 00 00 00 00       mov    %eax,0x0(%rip)

My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc and the one-instruction memory add generated by the straightforward code. Does this use of memory_order_relaxed constitute a data race?

Upvotes: 6

Views: 1579

Answers (2)

tmyklebu
tmyklebu

Reputation: 14215

It seems that the memory_order_relaxed trick is the right way to do this.

This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed store and load as the proper alternative.

I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed in the first place.

Upvotes: 0

Tobias Langner
Tobias Langner

Reputation: 10828

count for each thread separately and sum up after the threads joined. For intermediate results, you may also sum up in between, you result might be off though. This pattern is also faster. You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often.

And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2) but in your case it'll create problems with the caches so it's not advisable.

Upvotes: 1

Related Questions