syko
syko

Reputation: 3637

What's the most efficient implementation of CAS for atomic updates in C++?

I write parallel program which uses CAS operation for the updates on the shared-memory between the threads. (C++, Linux, x86)

The following is my implementation of 'Update' function that applies an update on the memory location pointed by a variable, a, by a value returned by f(*a, b).

inline bool CAS (int64_t* ptr, int64_t old_val, int64_t new_val) {
    return __sync_bool_compare_and_swap(ptr, old_val, new_val); 
}

inline void Update(int64_t* a, int64_t& b) {
    volatile int64_t expected, result;
    do {
        expected = *a;
        result = f(expected, b);
    } while (!CAS(a, expected, result));
}

I see that most of the other implementation uses almost the same code.

But I just wonder whether it's the most efficient or not, since I see a quite high CPI rate (1.2~1.5) from Vtune profiler.

If Update function is called from the most inner loop of nested computation loops, the do ... while () loop with branch would incur a significant branch mis-prediction. But considering that the semantic of CAS includes branch for comparison, it might be non-avoidable.

Is there any preferred implementation of the above Update function in any scenario? For example, compare-exchange-strong can beat compare-exchange-weak in some cases. If the function f in Update function is for addition, then using atomic_fetch_and_add provided by std::atomic would be preferred.

// This is the updated codes with the comments (No performance gain is observed, I am micro-optimising. But it might be better at worst case any way)

inline bool CAS (int64_t* ptr, int64_t& old_val, int64_t new_val) {
    return (std::atomic_compare_exchange_weak((std::atomic<int64_t>*) ptr, &old_val, new_val); 
}

inline void Update(int64_t* a, int64_t& b) {
    int64_t expected, result;
    do {
        expected = *a;
        result = f(expected, b);
    } while (!CAS(a, expected, result));
}

Upvotes: 2

Views: 752

Answers (1)

Davislor
Davislor

Reputation: 15144

The standard library has a portable implementation of this in <atomic>, the atomic_compare_exchange_weak() family of functions. You might get better performance out of it. Reader threads can do an atomic read with relaxed memory order if they only need some snapshot, or acquire if they need the latest. Relaxed memory order might be as simple as a memory read.

Most of your performance improvements are likely to come from better wait-free data structures and algorithms, though. A singly-linked list is often a very fast wait-free structure for CAS.

There are some special cases. I’m sure you know that, if only one thread is a writer, the others can simply read the updates with acquire/release semantics or even relaxed memory order. (Or, as a gcc/clang extension to match the builtin you use, through a volatile*.)

If you’re often seeing other threads finish and try to update at the same time, there might be a way to change the algorithm to space the workers out. In some algorithms, there might be a reason for threads that see an update to back off and yield to others.

Also be alert for the A-B-A bug. You don’t appear to check for it. If you don’t need to, you might be able to leverage the cmpxch16b instruction to CAS a 16-byte structure at once and get a better atomic update than you could with CAS on a single pointer.

Upvotes: 3

Related Questions