Why are the atomics much slower than the lock in this uncontended case?

Question

I wrote something using atomics rather than locks and perplexed at it being so much slower in my case I wrote the following mini test:

#include 
#include 

struct test
{
    test(size_t size) : index_(0), size_(size), vec2_(size)
        {
            vec_.reserve(size_);
            pthread_mutexattr_init(&attrs_);
            pthread_mutexattr_setpshared(&attrs_, PTHREAD_PROCESS_PRIVATE);
            pthread_mutexattr_settype(&attrs_, PTHREAD_MUTEX_ADAPTIVE_NP);

            pthread_mutex_init(&lock_, &attrs_);
        }

    void lockedPush(int i);
    void atomicPush(int* i);

    size_t              index_;
    size_t              size_;
    std::vector    vec_;
    std::vector    vec2_;
    pthread_mutexattr_t attrs_;
    pthread_mutex_t     lock_;
};

void test::lockedPush(int i)
{
    pthread_mutex_lock(&lock_);
    vec_.push_back(i);
    pthread_mutex_unlock(&lock_);
}

void test::atomicPush(int* i)
{
    int ii       = (int) (i - &vec2_.front());
    size_t index = __sync_fetch_and_add(&index_, 1);
    vec2_[index & (size_ - 1)] = ii;
}

int main(int argc, char** argv)
{
    const size_t N = 1048576;
    test t(N);

//     for (int i = 0; i < N; ++i)
//         t.lockedPush(i);

    for (int i = 0; i < N; ++i)
        t.atomicPush(&i);
}

If I uncomment the atomicPush operation and run the test with time(1) I get output like so:

real    0m0.027s
user    0m0.022s
sys     0m0.005s

and if I run the loop calling the atomic thing (the seemingly unnecessary operation is there because i want my function to look as much as possible as what my bigger code does) I get output like so:

real    0m0.046s
user    0m0.043s
sys     0m0.003s

I'm not sure why this is happening as I would have expected the atomic to be faster than the lock in this case...

When I compile with -O3 I see lock and atomic updates as follows:

lock:
    real    0m0.024s
    user    0m0.022s
    sys     0m0.001s

atomic:    
    real    0m0.013s
    user    0m0.011s
    sys     0m0.002s

In my larger app though the performance of the lock (single threaded testing) is still doing better regardless though..

Kerrek SB · Accepted Answer

An uncontended mutex is extremely fast to lock and unlock. With an atomic variable, you're always paying a certain memory synchronisation penalty (especially since you're not even using relaxed ordering).

Your test case is simply too naive to be useful. You have to test a heavily contended data access scenario.

Generally, atomics are slow (they get in the way of clever internal reordering, pipelining, and caching), but they allow for lock-free code which ensures that the entire program can make some progress. By contrast, if you get swapped out while holding a lock, everyone has to wait.

Why are the atomics much slower than the lock in this uncontended case?

Answers (2)

Related Questions