Missed optimization opportunity or required behavior due to acquire-release memory ordering?

Question

I'm currently trying to improve the performance of a custom "pseudo" stack, which is used like this (full code is provided at the end of this post):

void test() {
  theStack.stackFrames[1] = StackFrame{ "someFunction", 30 };      // A
  theStack.stackTop.store(1, std::memory_order_seq_cst);           // B
  someFunction();                                                  // C
  theStack.stackTop.store(0, std::memory_order_seq_cst);           // D

  theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E
  theStack.stackTop.store(1, std::memory_order_seq_cst);           // F
  someOtherFunction();                                             // G
  theStack.stackTop.store(0, std::memory_order_seq_cst);           // H
}

A sampler thread periodically suspends the target thread and reads stackTop and the stackFrames array.

My biggest performance problem are the sequentially-consistent stores to stackTop, so I'm trying to find out whether I can change them to release-stores.

The central requirement is: When the sampler thread suspends the target thread and reads stackTop == 1, then the information in stackFrames[1] needs to be fully present and consistent. This means:

When B is observed, A must also be observed. ("Don't increment stackTop before putting the stack frame in place.")
When E is observed, D must also be observed. ("When putting the next frame's information in place, the previous stack frame must have been exited.")

My understanding is that using release-acquire memory ordering for stackTop guarantees the first requirement, but not the second. More specifically:

No writes that are before the stackTop release-store in program order can be reordered to occur after it.

However, no statement is made about writes that occur after the release-store to stackTop in program order. Thus, my understanding is that E can be observed before D is observed. Is this correct?

But if that's the case, then wouldn't the compiler be able to reorder my program like this:

void test() {
  theStack.stackFrames[1] = StackFrame{ "someFunction", 30 };      // A
  theStack.stackTop.store(1, std::memory_order_release);           // B
  someFunction();                                                  // C

  // switched D and E:
  theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E
  theStack.stackTop.store(0, std::memory_order_release);           // D

  theStack.stackTop.store(1, std::memory_order_release);           // F
  someOtherFunction();                                             // G
  theStack.stackTop.store(0, std::memory_order_release);           // H
}

... and then combine D and F, optimizing away the zero store?

Because that's not what I'm seeing if I compile the above program using system clang on macOS:

$ clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o

main.o: file format Mach-O 64-bit x86-64

Disassembly of section __TEXT,__text:
__Z4testv:
       0:   55  pushq   %rbp
       1:   48 89 e5    movq    %rsp, %rbp
       4:   48 8d 05 5d 00 00 00    leaq    93(%rip), %rax
       b:   48 89 05 10 00 00 00    movq    %rax, 16(%rip)
      12:   c7 05 14 00 00 00 1e 00 00 00   movl    $30, 20(%rip)
      1c:   c7 05 1c 00 00 00 01 00 00 00   movl    $1, 28(%rip)
      26:   e8 00 00 00 00  callq   0 <__Z4testv+0x2B>
      2b:   c7 05 1c 00 00 00 00 00 00 00   movl    $0, 28(%rip)
      35:   48 8d 05 39 00 00 00    leaq    57(%rip), %rax
      3c:   48 89 05 10 00 00 00    movq    %rax, 16(%rip)
      43:   c7 05 14 00 00 00 23 00 00 00   movl    $35, 20(%rip)
      4d:   c7 05 1c 00 00 00 01 00 00 00   movl    $1, 28(%rip)
      57:   e8 00 00 00 00  callq   0 <__Z4testv+0x5C>
      5c:   c7 05 1c 00 00 00 00 00 00 00   movl    $0, 28(%rip)
      66:   5d  popq    %rbp
      67:   c3  retq

Specifically, the movl $0, 28(%rip) instruction at 2b is still present.

Coincidentally, this output is exactly what I need in my case. But I don't know if I can rely on it, because to my understanding it's not guaranteed by my chosen memory ordering.

So my main question is this: Does the acquire-release memory order give me another (fortunate) guarantee that I'm not aware of? Or is the compiler only doing what I need by accident / because it's not optimizing this particular case as well as it could?

Full code below:

// clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o

#include 
#include 

struct StackFrame
{
  const char* functionName;
  uint32_t lineNumber;
};

struct Stack
{
  Stack()
    : stackFrames{ StackFrame{ nullptr, 0 }, StackFrame{ nullptr, 0 } }
    , stackTop{0}
  {
  }

  StackFrame stackFrames[2];
  std::atomic stackTop;
};

Stack theStack;

void someFunction();
void someOtherFunction();

void test() {
  theStack.stackFrames[1] = StackFrame{ "someFunction", 30 };
  theStack.stackTop.store(1, std::memory_order_release);
  someFunction();
  theStack.stackTop.store(0, std::memory_order_release);

  theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 };
  theStack.stackTop.store(1, std::memory_order_release);
  someOtherFunction();
  theStack.stackTop.store(0, std::memory_order_release);
}

/**
 * // Sampler thread:
 *
 * #include 
 * #include 
 * #include 
 *
 * void suspendTargetThread();
 * void unsuspendTargetThread();
 * 
 * void samplerThread() {
 *   for (;;) {
 *     // Suspend the target thread. This uses a platform-specific
 *     // mechanism:
 *     //  - SuspendThread on Windows
 *     //  - thread_suspend on macOS
 *     //  - send a signal + grab a lock in the signal handler on Linux
 *     suspendTargetThread();
 * 
 *     // Now that the thread is paused, read the leaf stack frame.
 *     uint32_t stackTop =
 *       theStack.stackTop.load(std::memory_order_acquire);
 *     StackFrame& f = theStack.stackFrames[stackTop];
 *     std::cout << f.functionName << " at line "
 *               << f.lineNumber << std::endl;
 * 
 *     unsuspendTargetThread();
 * 
 *     std::this_thread::sleep_for(std::chrono::milliseconds(1));
 *   }
 * }
 */

And, to satisfy curiosity, this is the assembly if I use sequentially-consistent stores:

$ clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o

main.o: file format Mach-O 64-bit x86-64

Disassembly of section __TEXT,__text:
__Z4testv:
       0:   55  pushq   %rbp
       1:   48 89 e5    movq    %rsp, %rbp
       4:   41 56   pushq   %r14
       6:   53  pushq   %rbx
       7:   48 8d 05 60 00 00 00    leaq    96(%rip), %rax
       e:   48 89 05 10 00 00 00    movq    %rax, 16(%rip)
      15:   c7 05 14 00 00 00 1e 00 00 00   movl    $30, 20(%rip)
      1f:   41 be 01 00 00 00   movl    $1, %r14d
      25:   b8 01 00 00 00  movl    $1, %eax
      2a:   87 05 20 00 00 00   xchgl   %eax, 32(%rip)
      30:   e8 00 00 00 00  callq   0 <__Z4testv+0x35>
      35:   31 db   xorl    %ebx, %ebx
      37:   31 c0   xorl    %eax, %eax
      39:   87 05 20 00 00 00   xchgl   %eax, 32(%rip)
      3f:   48 8d 05 35 00 00 00    leaq    53(%rip), %rax
      46:   48 89 05 10 00 00 00    movq    %rax, 16(%rip)
      4d:   c7 05 14 00 00 00 23 00 00 00   movl    $35, 20(%rip)
      57:   44 87 35 20 00 00 00    xchgl   %r14d, 32(%rip)
      5e:   e8 00 00 00 00  callq   0 <__Z4testv+0x63>
      63:   87 1d 20 00 00 00   xchgl   %ebx, 32(%rip)
      69:   5b  popq    %rbx
      6a:   41 5e   popq    %r14
      6c:   5d  popq    %rbp
      6d:   c3  retq

Instruments identified the xchgl instructions as the most expensive part.

PaulR · Accepted Answer

You could write it like this:

void test() {
  theStack.stackFrames[1] = StackFrame{ "someFunction", 30 };      // A
  theStack.stackTop.store(1, std::memory_order_release);           // B
  someFunction();                                                  // C
  theStack.stackTop.exchange(0, std::memory_order_acq_rel);        // D

  theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E
  theStack.stackTop.store(1, std::memory_order_release);           // F
  someOtherFunction();                                             // G
  theStack.stackTop.exchange(0, std::memory_order_acq_rel);        // H
}

This should provide the second guarantee you are looking for, namely that E may not be observed before D. Otherwise I think the compiler will have the right to reorder the instructions as you suggested.

Since the sampler thread "acquires" stackTop and it suspends the target thread before reading which should provide additional synchronization, it should always see valid data when stackTop is 1.

If your sampler did not suspend the target thread, or if suspension does not wait for the thread to be actually suspended (check this), I think a mutex or equivalent would be necessary to prevent the sampler from reading stale data after reading stack top as one (example if it was suspended by the scheduler at the wrong moment).

If you can rely on the suspend to provide synchronization and just need to constrain reordering by the compiler, you should have a look at std::atomic_signal_fence

Missed optimization opportunity or required behavior due to acquire-release memory ordering?

Answers (1)

Related Questions