Optimizing atomic compare_exchange_weak

Question

I'm trying to create a template to hold an arbitrary enum type (for type safety) and store them as shown in the following code segment:

enum my_flags : uint8_t {
    value = 0x01,
    foo = 0x02,
    bar = 0x04
}

template 
class atomic_flags {
    FlagType fetch_and_set(FlagType f) {
        //FlagType old; // <- Undefined behavior! At least in theory.
        FlagType old = flag_.load(std::memory_order_relaxed); // Correct, but takes two times longer.
        while(!flag_.compare_exchange_weak(old, static_cast(old | f))) {}
        return old;
    }

    std::atomic flag_;
};

The storing itself is trivial and not directly relevant. What I'm interested in are the two commented lines. The first one is by C++ Standard definition undefined behavior (UB). The second one is what I should use for correctness. But benchmarking shows that it's 2x times slower than the first variant. At the same time the first variant always produces the expected behavior, using the msvc compiler. (Probably because the compiler now doesn't have to load old twice, because this is done by compare_exchange_weak anyway.)

Now my question: Is it possible to achieve the same performance without relying on UB? (Yes, this is part of a performance critical section.)

As a side note. If I directly substitute uint8_t as a type and use the standard functions for fetch_or the performance is equivalent to the UB case. It's probably possible to try and substitute FlagType in the flag_ definition directly through a type big enough to contain FlagType but this appears just as error prone to me.

EDIT: This is the code I use for testing correctness as well as for benchmarking (only the REQUIRE statements will be left out in the benchmark.)

TEST_CASE( "Testing atomic_flags", "[atomic_flags]" ) {
    enum my_enum : uint8 {
        clear  = 0x00,
        first  = 0x01,
        second = 0x02,
        third  = 0x04,
        fourth = 0x08,
        fifth  = 0x10,

        all = first | second | third | fourth | fifth,
    };

    atomic_flags flag(clear);

    REQUIRE(flag.fetch_and_set(first) == clear);
    REQUIRE(flag.fetch_and_set(second) == first);
    REQUIRE(flag.fetch_and_set(fifth) == (first | second));
    REQUIRE(flag.fetch_and_set(third) == (first | second | fifth));
    REQUIRE(flag.fetch_and_set(fourth) == (first | second | third | fifth));
    REQUIRE(flag.fetch_and_clear(all) == all); // Note: fetch_and_clear removes a flag.
    REQUIRE(flag.load() == clear);
}

My benchmarking results are 40ns for UB and 75ns for correct per call.

Chartas · Accepted Answer

Thanks everyone for the quick help in pointing out my possible errors.

The actual performance of those two versions are equivalent, but in the second case the compiler does not do all possible optimizations for some reason.

By wrapping every call of the fetch_and_set function in a benchmark::DoNotOptimize() statement, both cases peform equaly well (I'm using google microbenchmark lib, and this call avoids the return value being optimized away). Therefore the point of the original question is moot and the initialized value is obviously the correct choice.

Optimizing atomic compare_exchange_weak

Answers (1)

Related Questions