Reputation: 38
I'm currently trying to improve the performance of a custom "pseudo" stack, which is used like this (full code is provided at the end of this post):
void test() {
theStack.stackFrames[1] = StackFrame{ "someFunction", 30 }; // A
theStack.stackTop.store(1, std::memory_order_seq_cst); // B
someFunction(); // C
theStack.stackTop.store(0, std::memory_order_seq_cst); // D
theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E
theStack.stackTop.store(1, std::memory_order_seq_cst); // F
someOtherFunction(); // G
theStack.stackTop.store(0, std::memory_order_seq_cst); // H
}
A sampler thread periodically suspends the target thread and reads stackTop
and the stackFrames
array.
My biggest performance problem are the sequentially-consistent stores to stackTop
, so I'm trying to find out whether I can change them to release-stores.
The central requirement is: When the sampler thread suspends the target thread and reads stackTop == 1
, then the information in stackFrames[1]
needs to be fully present and consistent. This means:
stackTop
before putting the stack frame in place.")My understanding is that using release-acquire memory ordering for stackTop
guarantees the first requirement, but not the second. More specifically:
stackTop
release-store in program order can be reordered to occur after it.However, no statement is made about writes that occur after the release-store to stackTop
in program order. Thus, my understanding is that E can be observed before D is observed. Is this correct?
But if that's the case, then wouldn't the compiler be able to reorder my program like this:
void test() {
theStack.stackFrames[1] = StackFrame{ "someFunction", 30 }; // A
theStack.stackTop.store(1, std::memory_order_release); // B
someFunction(); // C
// switched D and E:
theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E
theStack.stackTop.store(0, std::memory_order_release); // D
theStack.stackTop.store(1, std::memory_order_release); // F
someOtherFunction(); // G
theStack.stackTop.store(0, std::memory_order_release); // H
}
... and then combine D and F, optimizing away the zero store?
Because that's not what I'm seeing if I compile the above program using system clang on macOS:
$ clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o
main.o: file format Mach-O 64-bit x86-64
Disassembly of section __TEXT,__text:
__Z4testv:
0: 55 pushq %rbp
1: 48 89 e5 movq %rsp, %rbp
4: 48 8d 05 5d 00 00 00 leaq 93(%rip), %rax
b: 48 89 05 10 00 00 00 movq %rax, 16(%rip)
12: c7 05 14 00 00 00 1e 00 00 00 movl $30, 20(%rip)
1c: c7 05 1c 00 00 00 01 00 00 00 movl $1, 28(%rip)
26: e8 00 00 00 00 callq 0 <__Z4testv+0x2B>
2b: c7 05 1c 00 00 00 00 00 00 00 movl $0, 28(%rip)
35: 48 8d 05 39 00 00 00 leaq 57(%rip), %rax
3c: 48 89 05 10 00 00 00 movq %rax, 16(%rip)
43: c7 05 14 00 00 00 23 00 00 00 movl $35, 20(%rip)
4d: c7 05 1c 00 00 00 01 00 00 00 movl $1, 28(%rip)
57: e8 00 00 00 00 callq 0 <__Z4testv+0x5C>
5c: c7 05 1c 00 00 00 00 00 00 00 movl $0, 28(%rip)
66: 5d popq %rbp
67: c3 retq
Specifically, the movl $0, 28(%rip)
instruction at 2b
is still present.
Coincidentally, this output is exactly what I need in my case. But I don't know if I can rely on it, because to my understanding it's not guaranteed by my chosen memory ordering.
So my main question is this: Does the acquire-release memory order give me another (fortunate) guarantee that I'm not aware of? Or is the compiler only doing what I need by accident / because it's not optimizing this particular case as well as it could?
Full code below:
// clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o
#include <atomic>
#include <cstdint>
struct StackFrame
{
const char* functionName;
uint32_t lineNumber;
};
struct Stack
{
Stack()
: stackFrames{ StackFrame{ nullptr, 0 }, StackFrame{ nullptr, 0 } }
, stackTop{0}
{
}
StackFrame stackFrames[2];
std::atomic<uint32_t> stackTop;
};
Stack theStack;
void someFunction();
void someOtherFunction();
void test() {
theStack.stackFrames[1] = StackFrame{ "someFunction", 30 };
theStack.stackTop.store(1, std::memory_order_release);
someFunction();
theStack.stackTop.store(0, std::memory_order_release);
theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 };
theStack.stackTop.store(1, std::memory_order_release);
someOtherFunction();
theStack.stackTop.store(0, std::memory_order_release);
}
/**
* // Sampler thread:
*
* #include <chrono>
* #include <iostream>
* #include <thread>
*
* void suspendTargetThread();
* void unsuspendTargetThread();
*
* void samplerThread() {
* for (;;) {
* // Suspend the target thread. This uses a platform-specific
* // mechanism:
* // - SuspendThread on Windows
* // - thread_suspend on macOS
* // - send a signal + grab a lock in the signal handler on Linux
* suspendTargetThread();
*
* // Now that the thread is paused, read the leaf stack frame.
* uint32_t stackTop =
* theStack.stackTop.load(std::memory_order_acquire);
* StackFrame& f = theStack.stackFrames[stackTop];
* std::cout << f.functionName << " at line "
* << f.lineNumber << std::endl;
*
* unsuspendTargetThread();
*
* std::this_thread::sleep_for(std::chrono::milliseconds(1));
* }
* }
*/
And, to satisfy curiosity, this is the assembly if I use sequentially-consistent stores:
$ clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o
main.o: file format Mach-O 64-bit x86-64
Disassembly of section __TEXT,__text:
__Z4testv:
0: 55 pushq %rbp
1: 48 89 e5 movq %rsp, %rbp
4: 41 56 pushq %r14
6: 53 pushq %rbx
7: 48 8d 05 60 00 00 00 leaq 96(%rip), %rax
e: 48 89 05 10 00 00 00 movq %rax, 16(%rip)
15: c7 05 14 00 00 00 1e 00 00 00 movl $30, 20(%rip)
1f: 41 be 01 00 00 00 movl $1, %r14d
25: b8 01 00 00 00 movl $1, %eax
2a: 87 05 20 00 00 00 xchgl %eax, 32(%rip)
30: e8 00 00 00 00 callq 0 <__Z4testv+0x35>
35: 31 db xorl %ebx, %ebx
37: 31 c0 xorl %eax, %eax
39: 87 05 20 00 00 00 xchgl %eax, 32(%rip)
3f: 48 8d 05 35 00 00 00 leaq 53(%rip), %rax
46: 48 89 05 10 00 00 00 movq %rax, 16(%rip)
4d: c7 05 14 00 00 00 23 00 00 00 movl $35, 20(%rip)
57: 44 87 35 20 00 00 00 xchgl %r14d, 32(%rip)
5e: e8 00 00 00 00 callq 0 <__Z4testv+0x63>
63: 87 1d 20 00 00 00 xchgl %ebx, 32(%rip)
69: 5b popq %rbx
6a: 41 5e popq %r14
6c: 5d popq %rbp
6d: c3 retq
Instruments identified the xchgl
instructions as the most expensive part.
Upvotes: 1
Views: 136
Reputation: 3707
You could write it like this:
void test() {
theStack.stackFrames[1] = StackFrame{ "someFunction", 30 }; // A
theStack.stackTop.store(1, std::memory_order_release); // B
someFunction(); // C
theStack.stackTop.exchange(0, std::memory_order_acq_rel); // D
theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E
theStack.stackTop.store(1, std::memory_order_release); // F
someOtherFunction(); // G
theStack.stackTop.exchange(0, std::memory_order_acq_rel); // H
}
This should provide the second guarantee you are looking for, namely that E may not be observed before D. Otherwise I think the compiler will have the right to reorder the instructions as you suggested.
Since the sampler thread "acquires" stackTop and it suspends the target thread before reading which should provide additional synchronization, it should always see valid data when stackTop is 1.
If your sampler did not suspend the target thread, or if suspension does not wait for the thread to be actually suspended (check this), I think a mutex or equivalent would be necessary to prevent the sampler from reading stale data after reading stack top as one (example if it was suspended by the scheduler at the wrong moment).
If you can rely on the suspend to provide synchronization and just need to constrain reordering by the compiler, you should have a look at std::atomic_signal_fence
Upvotes: 1