Reputation: 673
In GPUs the transactions to the L2 cache can be of size 32B, 64B or 128B (both read and write). And the total number of such transactions can be measured using nvprof metrics like gst_transactions and gld_transactions. However, I am unable to find any material that details how these transactions are mapped for DRAM access i.e how are these transactions being handled by the DRAM which usually has a different bus width? For example, the TitanXp GPU has a 384 bit global memory bus and the P100 has a 3072 bit memory bus. So how are the 32B, 64B or 128B instructions mapped to these memory buses. And how can I measure the number of transactions generated by the DRAM controller?
PS: The dram_read_transactions metric does not seem to do this. I say that because I get the same value for dram_read_transactions on the TitanXp and the P100 (even during sequential access) in-spite of the two having widely different bus widths.
Upvotes: 4
Views: 473
Reputation: 151869
Although GPU DRAM may have different (hardware) bus widths across different GPU types, the bus is always composed of a set of partitions, each of which has an effective width of 32 bytes. A DRAM transaction from the profiler perspective actually consists of one of these 32-byte transactions, not a transaction at full "bus width".
Therefore a (single) 32 byte transaction to L2, if it misses in the L2, will convert to a single 32-byte DRAM transaction. Transactions of higher granularity, such as 64-byte or 128-byte, will convert into the requisite number of 32-byte DRAM transactions. This is discoverable using any of the CUDA profilers.
These related questions here and here may be of interest as well.
Note that an "effective width" of 32 bytes, as used above, does not necessarily mean that a transaction requires 32bytes * 8bits/byte = 256 bit wide interface. DRAM busses can be "double-pumped" or "quad-pumped" which means a transaction may consist of multiple bits transferred per "wire" of the interface. Therefore you will find GPUs that have only a 128-bit wide (or even 64-bit wide) interface to GPU DRAM, but a "transaction" on these busses will still consist of 32-bytes, which will require multiple bits to be transferred (probably in multiple DRAM bus clock cycles) per "wire" of the interface.
Upvotes: 5