Reputation: 335
I'm trying to create a template to hold an arbitrary enum type (for type safety) and store them as shown in the following code segment:
enum my_flags : uint8_t {
value = 0x01,
foo = 0x02,
bar = 0x04
}
template <class FlagType>
class atomic_flags {
FlagType fetch_and_set(FlagType f) {
//FlagType old; // <- Undefined behavior! At least in theory.
FlagType old = flag_.load(std::memory_order_relaxed); // Correct, but takes two times longer.
while(!flag_.compare_exchange_weak(old, static_cast<FlagType>(old | f))) {}
return old;
}
std::atomic<FlagType> flag_;
};
The storing itself is trivial and not directly relevant. What I'm interested in are the two commented lines. The first one is by C++ Standard definition undefined behavior (UB). The second one is what I should use for correctness. But benchmarking shows that it's 2x times slower than the first variant. At the same time the first variant always produces the expected behavior, using the msvc compiler. (Probably because the compiler now doesn't have to load old
twice, because this is done by compare_exchange_weak
anyway.)
Now my question: Is it possible to achieve the same performance without relying on UB? (Yes, this is part of a performance critical section.)
As a side note. If I directly substitute uint8_t
as a type and use the standard functions for fetch_or
the performance is equivalent to the UB case. It's probably possible to try and substitute FlagType
in the flag_
definition directly through a type big enough to contain FlagType
but this appears just as error prone to me.
EDIT:
This is the code I use for testing correctness as well as for benchmarking (only the REQUIRE
statements will be left out in the benchmark.)
TEST_CASE( "Testing atomic_flags", "[atomic_flags]" ) {
enum my_enum : uint8 {
clear = 0x00,
first = 0x01,
second = 0x02,
third = 0x04,
fourth = 0x08,
fifth = 0x10,
all = first | second | third | fourth | fifth,
};
atomic_flags<my_enum> flag(clear);
REQUIRE(flag.fetch_and_set(first) == clear);
REQUIRE(flag.fetch_and_set(second) == first);
REQUIRE(flag.fetch_and_set(fifth) == (first | second));
REQUIRE(flag.fetch_and_set(third) == (first | second | fifth));
REQUIRE(flag.fetch_and_set(fourth) == (first | second | third | fifth));
REQUIRE(flag.fetch_and_clear(all) == all); // Note: fetch_and_clear removes a flag.
REQUIRE(flag.load() == clear);
}
My benchmarking results are 40ns for UB and 75ns for correct per call.
Upvotes: 2
Views: 275
Reputation: 335
Thanks everyone for the quick help in pointing out my possible errors.
The actual performance of those two versions are equivalent, but in the second case the compiler does not do all possible optimizations for some reason.
By wrapping every call of the fetch_and_set
function in a benchmark::DoNotOptimize()
statement, both cases peform equaly well (I'm using google microbenchmark lib, and this call avoids the return value being optimized away). Therefore the point of the original question is moot and the initialized value is obviously the correct choice.
Upvotes: 2