Reputation: 9183
I wrote something using atomics rather than locks and perplexed at it being so much slower in my case I wrote the following mini test:
#include <pthread.h>
#include <vector>
struct test
{
test(size_t size) : index_(0), size_(size), vec2_(size)
{
vec_.reserve(size_);
pthread_mutexattr_init(&attrs_);
pthread_mutexattr_setpshared(&attrs_, PTHREAD_PROCESS_PRIVATE);
pthread_mutexattr_settype(&attrs_, PTHREAD_MUTEX_ADAPTIVE_NP);
pthread_mutex_init(&lock_, &attrs_);
}
void lockedPush(int i);
void atomicPush(int* i);
size_t index_;
size_t size_;
std::vector<int> vec_;
std::vector<int> vec2_;
pthread_mutexattr_t attrs_;
pthread_mutex_t lock_;
};
void test::lockedPush(int i)
{
pthread_mutex_lock(&lock_);
vec_.push_back(i);
pthread_mutex_unlock(&lock_);
}
void test::atomicPush(int* i)
{
int ii = (int) (i - &vec2_.front());
size_t index = __sync_fetch_and_add(&index_, 1);
vec2_[index & (size_ - 1)] = ii;
}
int main(int argc, char** argv)
{
const size_t N = 1048576;
test t(N);
// for (int i = 0; i < N; ++i)
// t.lockedPush(i);
for (int i = 0; i < N; ++i)
t.atomicPush(&i);
}
If I uncomment the atomicPush operation and run the test with time(1)
I get output like so:
real 0m0.027s
user 0m0.022s
sys 0m0.005s
and if I run the loop calling the atomic thing (the seemingly unnecessary operation is there because i want my function to look as much as possible as what my bigger code does) I get output like so:
real 0m0.046s
user 0m0.043s
sys 0m0.003s
I'm not sure why this is happening as I would have expected the atomic to be faster than the lock in this case...
When I compile with -O3 I see lock and atomic updates as follows:
lock:
real 0m0.024s
user 0m0.022s
sys 0m0.001s
atomic:
real 0m0.013s
user 0m0.011s
sys 0m0.002s
In my larger app though the performance of the lock (single threaded testing) is still doing better regardless though..
Upvotes: 2
Views: 1958
Reputation: 8872
Just to add to the first answer, when you do a __sync_fetch_and_add
you actually enforce specific code ordering. From the documentation
A full memory barrier is created when this function is invoked
A memory barrier is when
a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction
Chances are even though your work is atomic, you are losing compiler optimizations by forcing ordering of instructions.
Upvotes: 1
Reputation: 477150
An uncontended mutex is extremely fast to lock and unlock. With an atomic variable, you're always paying a certain memory synchronisation penalty (especially since you're not even using relaxed ordering).
Your test case is simply too naive to be useful. You have to test a heavily contended data access scenario.
Generally, atomics are slow (they get in the way of clever internal reordering, pipelining, and caching), but they allow for lock-free code which ensures that the entire program can make some progress. By contrast, if you get swapped out while holding a lock, everyone has to wait.
Upvotes: 6