Reputation: 6566
I left the rest of implementation for simplicity because it is not relevant here. Consider the classical implemetation of Double-check loking descibed in Modern C++ Design.
Singleton& Singleton::Instance()
{
if(!pInstance_)
{
Guard myGuard(lock_);
if (!pInstance_)
{
pInstance_ = new Singleton;
}
}
return *pInstance_;
}
Here the author insists that we avoid the race condition. But I have read an article, which unfortunately I dont remeber very well, in which the following flow was described.
In that article the author stated then the trick is that on the line pInstance_ = new Singleton;
the memory can be allocated, assigned to pInstance that the constructor will be called on that memory.
Relying to standard or other reliable sources, can anyone please confirm or deny the probability or correctness of this flow? Thanks!
Upvotes: 3
Views: 1040
Reputation: 65046
The problem is that in the absence of guarantees otherwise, the store of the pointer into pInstance_
might be seen by some other thread before the construction of the object is complete. In that case, the other thread won't enter the mutex and will just immediately return pInstance_
and when the caller uses it, it can see uninitialized values.
This apparent reordering between the store(s) associated with the construction on Singleton
and the store to pInstance_
may be caused by compiler or the hardware. I'll take a quick look at both cases below.
Absent any specific guarantees guarantees related to concurrent reads (such as those offered by C++11's std::atomic
objects) the compiler only needs to preserve the semantics of the code as seen by the current thread. That means, for example, that it may compile code "out of order" to how it appears in the source, as long as this doesn't have visible side-effects (as defined by the standard) on the current thread.
In particular, it would not be uncommon for the compiler to reorder stores performed in the constructor for Singleton
, with the store to pInstance_
, as long as it can see that the effect is the same1.
Let's take a look at a fleshed out version of your example:
struct Lock {};
struct Guard {
Guard(Lock& l);
};
int value;
struct Singleton {
int x;
Singleton() : x{value} {}
static Lock lock_;
static Singleton* pInstance_;
static Singleton& Instance();
};
Singleton& Singleton::Instance()
{
if(!pInstance_)
{
Guard myGuard(lock_);
if (!pInstance_)
{
pInstance_ = new Singleton;
}
}
return *pInstance_;
}
Here, the constructor for Singleton
is very simple: it simply reads from the global value
and assigns it to the x
, the only member of Singleton
.
Using godbolt, we can check exactly how gcc and clang compile this. The gcc version, annotated, is shown below:
Singleton::Instance():
mov rax, QWORD PTR Singleton::pInstance_[rip]
test rax, rax
jz .L9 ; if pInstance != NULL, go to L9
ret
.L9:
sub rsp, 24
mov esi, OFFSET FLAT:_ZN9Singleton5lock_E
lea rdi, [rsp+15]
call Guard::Guard(Lock&) ; acquire the mutex
mov rax, QWORD PTR Singleton::pInstance_[rip]
test rax, rax
jz .L10 ; second check for null, if still null goto L10
.L1:
add rsp, 24
ret
.L10:
mov edi, 4
call operator new(unsigned long) ; allocate memory (pointer in rax)
mov edx, DWORD value[rip] ; load value global
mov QWORD pInstance_[rip], rax ; store pInstance pointer!!
mov DWORD [rax], edx ; store value into pInstance_->x
jmp .L1
The last few lines are critical, in particular the two stores:
mov QWORD pInstance_[rip], rax ; store pInstance pointer!!
mov DWORD [rax], edx ; store value into pInstance_->x
Effectively, the line pInstance_ = new Singleton;
been transformed into:
Singleton* stemp = operator new(sizeof(Singleton)); // (1) allocate uninitalized memory for a Singleton object on the heap
int vtemp = value; // (2) read global variable value
pInstance_ = stemp; // (3) write the pointer, still uninitalized, into the global pInstance (oops!)
pInstance_->x = vtemp; // (4) initialize the Singleton by writing x
Oops! Any second thread arriving when (3) has occurred, but (4) hasn't, will see a non-null pInstance_
, but then read an uninitialized (garbage) value for pInstance->x
.
So even without invoking any weird hardware reordering at all, this pattern isn't safe without doing more work.
Let's say you organize so that the reordering of the stores above doesn't occur on your compiler2, perhaps by putting a compiler barrier such as asm volatile ("" ::: "memory")
. With that small change, gcc now compiles this to have the two critical stores in the "desired" order:
mov DWORD PTR [rax], edx
mov QWORD PTR Singleton::pInstance_[rip], rax
So we're good, right?
Well on x86, we are. It happens that x86 has a relatively strong memory model, and all stores already have release semantics. I won't describe the full semantics, but in the context of two stores as above, it implies that stores appear in program order to other CPUs: so any CPU that sees the second write above (to pInstance_
) will necessarily see the prior write (to pInstance_->x
).
We can illustrate that by using the C++11 std::atomic
feature to explicitly ask for a release store for pInstance_
(this also enables us to get rid of the compiler barrier):
static std::atomic<Singleton*> pInstance_;
...
if (!pInstance_)
{
pInstance_.store(new Singleton, std::memory_order_release);
}
We get reasonable assembly with no hardware memory barriers or anything (there is a redundant load now, but this is both a missed-optimization by gcc and a consequence of the way we wrote the function).
So we're done, right?
Nope - most other platforms don't have the strong store-to-store ordering that x86 does.
Let's take a look at ARM64 assembly around the creation of the new object:
bl operator new(unsigned long)
mov x1, x0 ; x1 holds Singleton* temp
adrp x0, .LANCHOR0
ldr w0, [x0, #:lo12:.LANCHOR0] ; load value
str w0, [x1] ; temp->x = value
mov x0, x1
str x1, [x19, #pInstance_] ; pInstance_ = temp
So we have the str
to pInstance_
as the last store, coming after the temp->x = value
store, as we want. However, the ARM64 memory model doesn't guarantee that these stores appear in program order when observed by another CPU. So even though we've tamed the compiler, the hardware can still trip us up. You'll need a barrier to solve this.
Prior to C++11, there wasn't a portable solution for this problem. For a particular ISA you could use inline assembly to emit the right barrier. Your compiler might have a builtin like __sync_synchronize
offered by gcc
, or your OS might even have something.
In C++11 and beyond, however, we finally have a formal memory model built-in to the language, and what we need there, for doubled check locking is a release store, as the final store to pInstance_
. We saw this already for x86 where we checked that no compiler barrier was emitted, using std::atomic
with memory_order_release
the object publishing code becomes:
bl operator new(unsigned long)
adrp x1, .LANCHOR0
ldr w1, [x1, #:lo12:.LANCHOR0]
str w1, [x0]
stlr x0, [x20]
The main difference is the final store is now stlr
- a release store. You can check out the PowerPC side too, where an lwsync
barrier has popped up between the two stores.
So the bottom line is that:
std::atomic
with memory_order_acquire
loads and memory_order_release
stores.The above only covered half of the problem: the store of pInstance_
. The other half that can go wrong is the load, and the load is actually the most important for performance, since it represents the usual fast-path that gets taken after the singleton is initialized. What if the pInstance_->x
was loaded before pInstance
itself was loaded and checked for null? In that case, you could still read an uninitialized value!
This might seem unlikely, since pInstance_
needs to be loaded before it is deferenced, right? That is, there seems to be a fundamental dependency between the operations that prevents reordering, unlike the store case. Well, as it turns out, both hardware behavior and software transformation could still trip you up here, and the details are even more complex than the store case. If you use memory_order_acquire
though, you'll be fine. If you want the last once of performance, especially on PowerPC, you'll need to dig into the details of memory_order_consume
. A tale for another day.
1 In particular, this means that the compiler has to be able to see the code for the constructor Singleton()
so that it can determine that it doesn't read from pInstance_
.
2 Of course, it's very dangerous to rely on this since you'd have to check the assembly after every compilation if anything changed!
Upvotes: 5
Reputation: 149185
The problem you describe can only occur if for reasons I cannot imagine the conceptors of the singleton uses an explicit (and broken) 2 steps construction:
...
Guard myGuard(lock_);
if (!pInstance_)
{
auto alloc = std::allocator<Singleton>();
pInstance_ = alloc.allocate(); // SHAME here: race condition
// eventually other stuff
alloc.construct(_pInstance); // anything could have happened since allocation
}
....
Even if for any reason such a 2 step construction was required, the _pInstance
member shall never contain anything else that nullptr
or a fully constructed instance:
auto alloc = std::allocator<Singleton>();
Singleton *tmp = alloc.allocate(); // no problem here
// eventually other stuff
alloc.construct(tmp); // nor here
_pInstance = tmp; // a fully constructed instance
But beware: the fix is only guaranteed on a mono CPU. Things could be much worse on multi core systems where the C++11 atomics semantics are indeed required.
Upvotes: 4
Reputation: 67842
It used to be unspecified before C++11, because there was no standard memory model discussing multiple threads.
IIRC the pointer could have been set to the allocated address before the constructor completed so long as that thread would never be able to tell the difference (this could probably only happen for a trivial/non-throwing constructor).
Since C++11, the sequenced-before rules disallow that reordering, specifically
8) The side effect (modification of the left argument) of the built-in assignment operator ... is sequenced after the value computation ... of both left and right arguments, ...
Since the right argument is a new-expression, that must have completed allocation & construction before the left-hand-side can be modified.
Upvotes: 1