Reputation: 95
I have a pool of threads, each thread contains a counter (it is TLS basically).
A master thread is required to update frequently by computing the sum of all thread-local counters.
Most of the time, each thread increments its own counter, so no synchronization is needed.
But at the time the master thread is updating, I of course need some kind of synchronization.
I came up with MSVS intrinsics (_InterlockedXXX
functions), and it showed great performance (~ 0.8 s on my test)
However, it limits my code to MSVC compilers and X86/AMD64 platforms, but is there a C++-portable way to do it ?
I tried changing the int type to std::atomic<int>
for the counter, using std::memory_order_relaxed
for the incrementations but this solution is very slow ! (~ 4s)
When using the base member std::atomic<T>::_My_val
, the value is accessed non-atomically as I would like to, but it is not portable as well so the problem is the same...
Using a single std::atomic<int>
shared by all threads is even slower, due to high contention (~ 10 s)
Do you have some ideas? Perhaps should I use a library (boost)? Or write my own class?
Upvotes: 5
Views: 852
Reputation: 95
You are definitely right : a std::atomic<int>
per thread is needed for portability, even if it is somehow slow.
However, it can be (very) optimized in the case of X86 and AMD64 architectures.
Here's what I got, sInt
being a signed 32- or 64- bit.
// Here's the magic
inline sInt MyInt::GetValue() {
return *(volatile sInt*)&Value;
}
// Interlocked intrinsic is atomic
inline void MyInt::SetValue(sInt _Value) {
#ifdef _M_IX86
_InterlockedExchange((volatile sInt *)&Value, _Value);
#else
_InterlockedExchange64((volatile sInt *)&Value, _Value);
#endif
}
This code will work in MSVS with a X86 architecture (needed for GetValue()
)
Upvotes: 0
Reputation: 157354
std::atomic<int>::fetch_add(1, std::memory_order_relaxed)
is just as fast as _InterlockedIncrement
.
Visual Studio compiles the former to lock add $1
(or equivalent) and the latter to lock inc
, but there is no difference in execution time; on my system (Core i5 @3.30 GHz) each take 5630 ps/op, around 18.5 cycles.
Microbenchmark using Benchpress:
#define BENCHPRESS_CONFIG_MAIN
#include "benchpress/benchpress.hpp"
#include <atomic>
#include <intrin.h>
std::atomic<long> counter;
void f1(std::atomic<long>& counter) { counter.fetch_add(1, std::memory_order_relaxed); }
void f2(std::atomic<long>& counter) { _InterlockedIncrement((long*)&counter); }
BENCHMARK("fetch_add_1", [](benchpress::context* ctx) {
auto& c = counter; for (size_t i = 0; i < ctx->num_iterations(); ++i) { f1(c); }
})
BENCHMARK("intrin", [](benchpress::context* ctx) {
auto& c = counter; for (size_t i = 0; i < ctx->num_iterations(); ++i) { f2(c); }
})
Output:
fetch_add_1 200000000 5634 ps/op
intrin 200000000 5637 ps/op
Upvotes: 2
Reputation: 95
I came up with this kind of implementation which suits me. However, I can't find a way to code semi_atomic<T>::Set()
#include <atomic>
template <class T>
class semi_atomic<T> {
T Val;
std::atomic<T> AtomicVal;
semi_atomic<T>() : Val(0), AtomicVal(0) {}
// Increment has no need for synchronization.
inline T Increment() {
return ++Val;
}
// Store the non-atomic Value atomically and return it.
inline T Get() {
AtomicVal.store(Val, std::memory_order::memory_order_release);
return AtomicVal.load(std::memory_order::memory_order_relaxed);
}
// Load _Val into Val, but in an atomic way (?)
inline void Set(T _Val) {
_InterlockedExchange((volatile long*)&Val, _Val); // And with C++11 ??
}
}
Thank you and tell me if something is wrong !
Upvotes: 0