Reputation: 675

Since PCIe write TLP is Post, what will happen when CPU access memory mapped bar address very frequently?

Since PCIe write is a Post TPL, what will happen when CPU tries to write to a memory mapped BAR address very frequently?

For example, write a busy loop and update a Register on a PCIe device.

When the outstanding Write TPL reatch its max limit, what will happen? Will the TPL be dropped? Or will CPU stop and wait? If it waits, since its a Post TLP, how does CPU know when the write is finished?

And will the behavior different on different arch like x86-64 vs ARM?

Upvotes: 0

Answers (2)

AndrewH

Reputation: 1

A write TLP can only be initiated on PCI if flow control credit is available. When you write, credit is used and restored by the PCI device.

By making a tight CPU loop of PCI writes you will be able to write as fast as the CPU can as long as credit is available. Once credit is used up you will write as fast as the device can free up credit.

Upvotes: 0

Margaret Bloom

Reputation: 44126

TL;DR: The CPU will be eventually stalled. Either the Root Complex (RC) will stall the CPU until the posted write cycle is completed or when its internal write queue is full (if it has any). Note that even though posted writes don't require a Completion TLP, the Data Link Layer can still tell when a TLP has finished being transmitted. After all, the PCIe timings are finite, every transaction can take up to a maximum number of cycles to complete. Posted writes are made of a single TLP, and non-posted writes (and read) are made of two (the second one being the completion). Thus the RC considers a posted write finished when it has sent the TLP down to one of its Root Port. At that point, it is ready to process the next CPU request. This is generally true for any switch in between and for the devices themselves.

Dropping a TLP for a posted write would be considered a hardware error, potentially reported through the Machine Check Architecture (MCA) extensions.

I don't know if ARM is different but the PCIe part should be the same and the CPU-to-PCI Root Complex should also be similar since posted writes are better suited for writes to MMIO regardless.

PCI(e) has a very defined timing for using the bus. Once the posted write cycle is complete, the write is considered finished. There is no feedback from the device. Before a device can transmit it has to "acquire" the bus, sensing if it's idle. Once transmission is started, there is a maximum number of cycles allowed to complete it. So any agent wanting to transmit a TLP knows that it has to eventually wait before doing so. The CPU talks to the Root Complex to generate TLPs, the RC can stall the CPU until it has transmitted the posted write (or received the completion for non-posted write) or it could have a write queue and stall the CPU only if the former is full (I've never tested which is true, if any).

Once the posted write cycle is finished (according to the PCIe specs) the write is considered successful. Writes to MMIO usually only update the device's internal status (hold in registers) so they complete very fast. The device logic will be triggered by updates in these registers (and not by the writes directly, usually) so any action taken is kind of asynchronous wrt to the PCIe transactions.

The software usually reads from the status register until a bit indicating a command is in progress is set by the device logic. Interrupts can also be used to avoid polling but in one way or another, there is a synchronization orthogonal to the PCI(e) protocol.
Afterall, stores don't have a "return value" so the CPU has no way to report a failed write.

A lot of devices have clear indications (in their datasheets) on how to properly send commands, including when is ok to do so. Failing to follow what's in the datasheet usually has undefined behavior, like restarting the action midway or updating the internal registers but not acting upon the new values.

Only if the posted write cannot reach the device, maybe because it doesn't exist or a bridge is misconfigured, then the write fail (but the CPU has no feedback through the store itself, in some cases, the MCA machinery may kick in I think).

A case that requires the software to use non-posted writes to MMIO (IO writes are always non-posted) is that of devices that perform individual units of work, typically accelerators (including GPUs). These devices must use queues to accept new work descriptors.
Intel introduced the enqcmq(s) instructions, which return in the zf whenever the devices accepted the write or rejected it for a later retry.
This kind of non-posted writes is called Deferrable Memory Writes. I cannot find a free copy of the specs but I guess they differ from ordinary non-posted writes only in the values used in the completions.

Upvotes: 2

Since PCIe write TLP is Post, what will happen when CPU access memory mapped bar address very frequently?

Answers (2)

Related Questions