mdashti
mdashti

Reputation: 131

DeviceToDevice bandwidth formula in CUDA sample code bandwitdhTest

The formula in the sample code provided with the SDK is the following (for DtoD transfer):

bandwidthInMBs = 2.0f * ((float)(1<<10) * memSize * (float)MEMCOPY_ITERATIONS) / (elapsedTimeInMs * (float)(1 << 20));

The 2.0f multiplier in the beginning does not exist for the DtoH and HtoD cases. Why? Is this because for the DtoD case, two copying operations are performed, so twice the memSize is actually transferred?

Also, how accurate is this formula on a physically unified system such as the Jetson TK1? Is the 2.0f multiplier necessary?

For example, on the Jetson TK1 I'm getting the following numbers:

DtoH = 6.1 GB/s

HtoD = 6.1 GB/s

DtoD = 12.2 GB/s (just because of the multiplier!)

Upvotes: 0

Views: 391

Answers (1)

talonmies
talonmies

Reputation: 72349

[ Summarizing comments into a answer with the hope of getting the question off the unanswered list for the CUDA tag where it has been languishing for more than four years ]

The 2.0f multiplier in the beginning does not exist for the DtoH and HtoD cases. Why?

Because (in a conventional system) a device-to-host or host-to-device operation involves only either read or write operations in device memory. A device-to-device operation involves both a read from, and a write to device memory, thus twice as many device memory transactions are involved per byte of transfer and double the memory bandwidth is being consumed.

Is this because for the DtoD case, two copying operations are performed, so twice the memSize is actually transferred?

More or less, yes.

Also, how accurate is this formula on a physically unified system such as the Jetson TK1?

Nothing changes. A device-to-device transfer still involves two memory transactions per byte transfers, thus twice the bandwidth is consumed.

Is the 2.0f multiplier necessary?

Yes, and you could argue that the two times multiplier is also required for host-to-device and device-to host-transfers on shared memory systems, because they are, in essence, identical operations to the device-to-device transfers and consume twice the memory bandwidth per byte of memory transfer.

Upvotes: 1

Related Questions