Prefetch instruction behavior

Question

In order to satisfy some security property, I want to make sure that an important data is already in the cache when a statement accesses it (so there will be no cache miss). For example, for this code

...
a += 2;
...

I want to make sure that a is in the cache right before a += 2 is executed.

I was considering to use the PREFETCHh instruction of x86 to achieve this:

...
__prefetch(&a);     /* pseudocode */
a += 2;
...

However, I have read that inserting the prefetch instruction right before a += 2 might be too late to ensure a is in the cache when a += 2 gets executed. Is this claim true? If it is true, can I fix it by inserting a CPUID instruction after prefetch to ensure the prefectch instruction has been executed (because the Intel manual says PREFETCHh is ordered with respect to CPUID)?

Peter Cordes · Accepted Answer

Yes, you need to prefetch with a lead-time of about the memory latency for it to be optimal. Ulrich Drepper's What Every Programmer Should Know About Memory talks a lot about prefetching.

Making this happen will be highly non-trivial for a single access. Too soon and your data might be evicted before the insn you care about. Too late and it might reduce the access time some. Tuning this will depend on compiler version/options, and on the hardware you're running on. (Higher instructions-per-cycle means you need to prefetch earlier. Higher memory latency also means you need to prefetch earlier).

Since you want to do a read-modify-write to a, you should use PREFETCHW if available. The other prefetch instructions only prefetch for reading, so the read part of a the RMW could hit, but I think the store part could be delayed by MOSI cache coherency getting write-ownership of the cache line.

If a isn't atomic, you can also just load a well ahead of time and use the copy in a register. The store back to the global could easily miss in this case, which could eventually stall execution, though.

You'll probably have a hard time doing some that reliably with a compiler, instead of writing asm yourself. Any of the other ideas will also require checking the compiler output to make sure the compiler did what you're hoping.

Prefetch instructions don't necessarily prefetch anything. They're "hints", which presumably get ignored when the number of outstanding loads is near max (i.e. almost out of load buffers).

Another option is to load it (not just prefetch) and then serialize with a CPUID. (A load that throws away the result is like a prefetch). The load would have to complete before the serializing instruction, and instructions after the serializing insn can't start decoding until then. I think a prefetch can retire before the data arrives, which is normally an advantage, but not in this case where we care about one operation hitting at the expense of overall performance.

From Intel's insn ref manual (see the x86 tag wiki) entry for CPUID:

Serializing instruction execution guarantees that any modifications to flags, registers, and memory for previous instructions are completed before the next instruction is fetched and executed.

I think a sequence like this is fairly good (but still doesn't guarantee anything in a pre-emptive multi-tasking system):

add [mem], 0        # can't retire until the store completes, requiring that our core owns the cache line for writing
CPUID               # later insns can't start until the prev add retires
add [mem], 2        # a += 2   Can't miss in cache unless an interrupt or the other hyper-thread evicts the cache line before this insn can execute

Here we're using add [mem], 0 as a write-prefetch which is otherwise a near no-op. (It is a non-atomic read-modify-rewrite). I'm not sure if PREFETCHW really will ensure the cache line is ready if you do PREFETCHW / CPUID / add [mem], 2. The insn is ordered wrt. CPUID, but the manual doesn't say that the prefetch effect is ordered.

If a is volatile, then (void)a; will get gcc or clang to emit a load insn. I assume most other compilers (MSVC?) are the same. You can probably do (void) *(volatile something*)&a to dereference a pointer to volatile and force a load from a's address.

To guarantee that a memory access will hit in cache, you'd need to be running at realtime priority pinned to a core that doesn't receive interrupts. Depending on the OS, the timer-interrupt handler is probably lightweight enough that the chance of evicting your data from cache is low enough.

If your process is descheduled between executing a prefetch insn and doing the real access, the data will probably have been evicted from at least L1 cache.

So it's unlikely you can defeat an attacker determined to do a timing attack on your code, unless it's realistic to run at realtime priority. An attacker could run many many threads of memory-intensive code...

Prefetch instruction behavior

Answers (1)

Related Questions