Does the branch predictor also include I/O instructions in its prediction?

Question

I am currently writing an Intel 8042 driver and have written two loops to wait until some buffers are ready for use:

/* Waits until the Intel 8042's input buffer is empty, i.e., until the
 * controller has processed the input. */
i8042_waitin:
    pause
    in $i8042_STATUS, %al
    and $i8042_STAT_INEMPTY, %al
    jz i8042_waitin
    ret

/* Waits until the Intel 8042's output buffer is full, i.e., data to read is
 * available.
 * ATTENTION: this here is the polling variant but there is also a way with
 * interrupts! By setting bit 0 in the command byte you can enable an interrupt
 * to be fired when the output buffer is full. */
i8042_waitout:
    pause
    in $i8042_STATUS, %al
    and $i8042_STAT_OUTFULL, %al
    jz i8042_waitout
    ret

As you can see, I inserted pause instructions in the loops. I've just recently learned about it and wanted to try it out, naturally.
Since the content of %al is unpredictable because it's an I/O read, the branch predictor will fill the pipeline with instructions of the loop: after some iterations, it will notice always one branch is taken, similarly to the case here.

The above is correct if the branch predictor really includes I/O instructions in its prediction, which I am not sure of.

So does the branch predictor adjust itsself using the result of I/O instructions as is the case with unpredictable memory reads? Or is there something else going on here?
Does pause make sense here?

Ross Ridge · Accepted Answer

The branch predictor doesn't include any other instructions in is predictions. It just makes it guesses based on the branch instruction itself and/or its previous history of branches. None of the other instructions in the loop, PAUSE, IN or AND have any effect on branch prediction.

The PAUSE instruction suggested in the answer you linked isn't meant to affect the branch predictor. It's meant to prevent pipeline stalls that happen when memory location accessed by the CMP instruction in that question's example code is written to by another processor. The CMP instruction also doesn't affect branch prediction.

Peter Cordes mentions that you might be confused by the different techniques the CPU uses to speculatively execute instructions in order to try to keep its pipelines full. In the question you linked there were two different ways speculative execution ended up hurting the performance of the spin lock. Both have a common root, the CPU is trying to execute the loop as fast a possible, but actually what affects the performance of the spin lock is how fast it comes out of the loop. Only the speed of the final iteration of the loop matters.

The first part of the problem of speculative execution with the spin lock code is that the branch predictor will quickly assume that branch is always taken. On the final iteration of the loop there will be a stall because the CPU will have gone on to speculatively execute another iteration of the loop. It has to toss that away and then start executing the code outside the loop. But it turns out it's even worse, because the CPU will speculatively read the memory location used in the CMP instruction. Because it accesses normal memory, speculative reads are harmless, they have no side effects. (This is unlike your IN instruction, as I/O reads from devices can have side effects.) This allows the CPU to speculatively execute multiple iterations of the loop. When another CPU changes the memory location this invalidates all instructions that depend on the speculative reads in the pipeline and so the CPU executing the spin lock ends up stalling while it clears them from the pipeline.

In your code I don't think the PAUSE instruction will improve the performance of the loop. The IN instruction doesn't access normal memory, so it can't result in the pipeline being flushed because of writes to memory of other CPUs. Since the IN instruction also can't be speculatively executed there can only be one IN instruction in the pipeline at time, so the cost of this mispredicted branch at the end of the loop will be relatively small. It may have the other benefits of mentioned in that answer, reducing power usage and making more execution resources available to the other logical CPU on hyperthreading processors.

Not that it really it really matters. It takes over a million of cycles on a modern processor for the keyboard controller to send or receive a single byte, even a few hundred cycles on top of that because some worst case pipeline stall isn't significant.

Does the branch predictor also include I/O instructions in its prediction?

Answers (1)

Related Questions