Reputation: 2807
In Linux KCOV code, why is this barrier()
placed?
void notrace __sanitizer_cov_trace_pc(void)
{
struct task_struct *t;
enum kcov_mode mode;
t = current;
/*
* We are interested in code coverage as a function of a syscall inputs,
* so we ignore code executed in interrupts.
*/
if (!t || in_interrupt())
return;
mode = READ_ONCE(t->kcov_mode);
if (mode == KCOV_MODE_TRACE) {
unsigned long *area;
unsigned long pos;
/*
* There is some code that runs in interrupts but for which
* in_interrupt() returns false (e.g. preempt_schedule_irq()).
* READ_ONCE()/barrier() effectively provides load-acquire wrt
* interrupts, there are paired barrier()/WRITE_ONCE() in
* kcov_ioctl_locked().
*/
barrier();
area = t->kcov_area;
/* The first word is number of subsequent PCs. */
pos = READ_ONCE(area[0]) + 1;
if (likely(pos < t->kcov_size)) {
area[pos] = _RET_IP_;
WRITE_ONCE(area[0], pos);
}
}
}
A barrier()
call prevents the compiler from re-ordering instructions. However, how is that related to interrupts here? Why is it needed for semantic correctness?
Upvotes: 2
Views: 297
Reputation: 365537
Without barrier()
, the compiler would be free to access t->kcov_area
before t->kcov_mode
. It's unlikely to want to do that in practice, but that's not the point. Without some kind of barrier, C rules allow the compiler to create asm that doesn't do what we want. (The C11 memory model has no ordering guarantees beyond what you impose explicitly; in C11 via stdatomic
or in Linux / GNU C via barriers like barrier()
or smp_rb()
.)
As described in the comment, barrier()
is creating an acquire-load wrt. code running on the same core, which is all you need for interrupts.
mode = READ_ONCE(t->kcov_mode);
if (mode == KCOV_MODE_TRACE) {
...
barrier();
area = t->kcov_area;
...
I'm not familiar with kcov in general, but it looks like seeing a certain value in t->kcov_mode
with an acquire load makes it safe to read t->kcov_area
. (Because whatever code writes that object writes kcov_area
first, then does a release-store to kcov_mode
.)
https://preshing.com/20120913/acquire-and-release-semantics/ explains acq / rel synchronization in general.
Why isn't smp_rb()
required? (Even on weakly-ordered ISAs where acquire ordering would need a fence instruction to guarantee seeing other stores done by another core.)
An interrupt handler runs on the same core that was doing the other operations, just like a signal handler interrupts a thread and runs in its context. struct task_struct *t = current
means that the data we're looking at is local to a single task. This is equivalent to something within a single thread in user-space. (Kernel pre-emption leading to re-scheduling on a different core will use whatever memory barriers are necessary to preserve correct execution of a single thread when that other core accesses the memory this task had been using).
The user-space C11 stdatomic
equivalent of this barrier is atomic_signal_fence(memory_order_acquire)
. Signal fences only have to block compile-time reordering (like Linux barrier()
), unlike atomic_thread_fence
that has to emit a memory barrier asm instruction.
Out-of-order CPUs do reorder things internally, but the cardinal rule of OoO exec is to preserve the illusion of instructions running one at a time, in order for the core running the instructions. This is why you don't need a memory barrier for the asm equivalent of a = 1; b = a;
to correctly load the 1 you just stored; hardware preserves the illusion of serial execution1 in program order. (Typically via having loads snoop the store buffer and store-forward from stores to loads for stores that haven't committed to L1d cache yet.)
Instructions in an interrupt handler logically run after the point where the interrupt happened (as per the interrupt-return address). Therefore we just need the asm instructions in the right order (barrier()
), and hardware will make everything work.
Footnote 1: There are some explicitly-parallel ISAs like IA-64 and the Mill, but they provide rules that asm can follow to be sure that one instruction sees the effect of another earlier one. Same for classic MIPS I load delay slots and stuff like that. Compilers take care of this for compiled C.
Upvotes: 2