Reputation: 14215
The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:
int num_times_done_it; // global
void doit() {
++num_times_done_it;
// do something
}
void report_stats() {
printf("called doit %i times\n", num_times_done_it);
// and probably some other stuff too
}
Unfortunately, if multiple threads can call doit
without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it
may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats
can be called concurrently with doit
absent any synchronisation, there's another data race between the thread modifying num_times_done_it
and the thread reporting its value.
Often, the programmer just wants a mostly-right count of the number of times doit
has been called with as little overhead as possible.
(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)
Apparent non-solutions:
volatile
into the mix makes data races OK, so replacing the declaration of num_times_done_it
by volatile int num_times_done_it
doesn't fix anything.report_stats
, but that doesn't solve the data race between doit
and report_stats
. Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage.Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?
EDIT: It seems that we can do this in a slightly indirect way using memory_order_relaxed
:
atomic<int> num_times_done_it;
void doit() {
num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
memory_order_relaxed);
// as before
}
However, gcc 4.8.2
generates this code on x86_64 (with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: 83 c0 01 add $0x1,%eax
9: 89 05 00 00 00 00 mov %eax,0x0(%rip)
and clang 3.4
generates this code on x86_64 (again with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: ff c0 inc %eax
8: 89 05 00 00 00 00 mov %eax,0x0(%rip)
My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc
and the one-instruction memory add
generated by the straightforward code. Does this use of memory_order_relaxed
constitute a data race?
Upvotes: 6
Views: 1579
Reputation: 14215
It seems that the memory_order_relaxed
trick is the right way to do this.
This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed
store
and load
as the proper alternative.
I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed
in the first place.
Upvotes: 0
Reputation: 10828
count for each thread separately and sum up after the threads joined. For intermediate results, you may also sum up in between, you result might be off though. This pattern is also faster. You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often.
And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2) but in your case it'll create problems with the caches so it's not advisable.
Upvotes: 1