Atomicity of small PCIE TLP writes

Question

Are there any guarantees about how card to host writes from a PCIe device targeting regular memory are implemented from a software process' perspective, where a single TLP write is fully contained within a single CPU cache-line?

I'm wondering about a case where my device may write some number of words of data followed by a byte to indicate that the structure is now valid (for example an event completion), for example:

struct PCIE_COMPLETION_T {
    uint64_t  data_a;
    uint64_t  data_b;
    uint64_t  data_c;
    uint64_t  data_d;
    uint8_t   valid;
} alignas(SYSTEM_CACHE_LINE_SIZE);

Can I use a single TLP to write this structure, such that when software sees the valid member change to 1 (having been previously cleared to zero by software), then will the other data members will also reflect the values that I had written and not a previous value?

Currently I'm performing 2 writes, first writing the data and secondly marking it as valid, which doesn't have any apparent race conditions but does of course add unwanted overhead.

The most relevant question I can see on this site seems to be Are writes on the PCIe bus atomic? although this appears to relate to the relative ordering of TLPs.

Perusing the PCIe 3.0 specification, I didn't find anything that seemed to explicitly cover my concerns, I don't think that I need AtomicOps particularly. Given that I'm only concerned about interactions with x86-64 systems, I also dug through the Intel architecture guide but also came up no clearer.

Instinctively it seems that it should be possible for such a write to be perceived atomically -- especially as it is said to be a transaction -- but equally I can't find much in the way of documentation explicitly confirming that view (nor am I quite sure what I'd need to look at, probably the CPU vendor?). I also wonder if such a scheme can be extended over multiple cachelines -- ie if the valid sits on a second cacheline written from the same TLP transaction can I be assured that the first will be perceived no later than the second?

prl · Accepted Answer

The write may be broken into smaller units, as small as dwords, but if it is, they must be observed in increasing address order.

PCIe revision 4, section 2.4.3:

If a single write transaction containing multiple DWs and the Relaxed Ordering bit Clear is accepted by a Completer, the observed ordering of the updates to locations within the Completer's data buffer must be in increasing address order. This semantic is required in case a PCI or PCI-X Bridge along the path combines multiple write transactions into the single one. However, the observed granularity of the updates to the Completer's data buffer is outside the scope of this specification.

While not required by this specification, it is strongly recommended that host platforms guarantee that when a PCI Express write updates host memory, the update granularity observed by a host CPU will not be smaller than a DW.

As an example of update ordering and granularity, if a Requester writes a QW to host memory, in some cases a host CPU reading that QW from host memory could observe the first DW updated and the second DW containing the old value.

I don't have a copy of revision 3, but I suspect this language is in that revision as well. To help you find it, Section 2.4 is "Transaction Ordering" and section 2.4.3 is "Update Ordering and Granularity Provided by a Write Transaction".

Atomicity of small PCIE TLP writes

Answers (1)

Related Questions