Why Did LOCK-prefixed Instructions Become Preferred Over MFENCE for Memory Fences in the JVM on x86?

Question

Context

This question explores the semantics of the MFENCE full memory barrier instruction in the x86 instruction set, particularly in comparison with modifying instructions that include the LOCK prefix.

Historically, the MFENCE instruction was used in the JVM to implement memory ordering guarantees, such as those required for the volatile keyword. However, a significant shift occurred in the late 2000s. In David Dice’s blog post from May 2009, he described a transition to using LOCK-prefixed instructions, such as LOCK:ADD to the top of stack (TOS), citing superior performance.

Dice noted that on modern AMD processors at the time, LOCK:ADD demonstrated nearly twice the latency improvement over MFENCE. Similarly, on Intel processors, LOCK:ADD performed better within the instruction pipeline. These observations motivated JVM developers to switch to LOCK-prefixed instructions for implementing memory fences, particularly in the context of volatile.

This shift is intriguing because LOCK-prefixed modifying instructions theoretically perform additional work compared to the streamlined MFENCE, such as enforcing atomicity. One might expect this to make LOCK-prefixed instructions less efficient. Dice speculated that MFENCE might carry additional, less-visible semantics—possibly related to pipeline states or inter-core synchronization—leading to its relatively lower performance. However, the exact nature of this potential "hidden semantics" remains unclear.

Moreover, LOCK-prefixed instructions provide added flexibility. They can serve as both memory fences and atomic operations, making them highly optimized for various workloads on modern CPUs.

Key Questions

Does the LOCK prefix fully encompass the semantics of MFENCE, or does it only address reordering prevention?
Does MFENCE include additional "hidden" semantics beyond its documented behavior, such as implications for pipeline management or cross-core synchronization?
Why did LOCK-prefixed instructions surpass MFENCE in performance over time, despite their seemingly broader functionality?

I’d greatly appreciate detailed insights into these questions, particularly from those familiar with low-level architecture and microcode optimizations. Thank you in advance!

Why Did LOCK-prefixed Instructions Become Preferred Over MFENCE for Memory Fences in the JVM on x86?

Answers (0)

Related Questions