Reputation: 235
Context
This question explores the semantics of the MFENCE full memory barrier instruction in the x86 instruction set, particularly in comparison with modifying instructions that include the LOCK prefix.
Historically, the MFENCE instruction was used in the JVM to implement memory ordering guarantees, such as those required for the volatile
keyword. However, a significant shift occurred in the late 2000s. In David Dice’s blog post from May 2009, he described a transition to using LOCK-prefixed instructions, such as LOCK:ADD
to the top of stack (TOS), citing superior performance.
Dice noted that on modern AMD processors at the time, LOCK:ADD demonstrated nearly twice the latency improvement over MFENCE. Similarly, on Intel processors, LOCK:ADD performed better within the instruction pipeline. These observations motivated JVM developers to switch to LOCK-prefixed instructions for implementing memory fences, particularly in the context of volatile
.
This shift is intriguing because LOCK-prefixed modifying instructions theoretically perform additional work compared to the streamlined MFENCE, such as enforcing atomicity. One might expect this to make LOCK-prefixed instructions less efficient. Dice speculated that MFENCE might carry additional, less-visible semantics—possibly related to pipeline states or inter-core synchronization—leading to its relatively lower performance. However, the exact nature of this potential "hidden semantics" remains unclear.
Moreover, LOCK-prefixed instructions provide added flexibility. They can serve as both memory fences and atomic operations, making them highly optimized for various workloads on modern CPUs.
Key Questions
I’d greatly appreciate detailed insights into these questions, particularly from those familiar with low-level architecture and microcode optimizations. Thank you in advance!
Upvotes: 1
Views: 39