user2188453
user2188453

Reputation: 1145

How many cudaMemcpyAsync operations can be done concurrently?

Considering the following case:

//thread 0 on device 0:
cudaMemcpyAsync(Dst0, Src0, ..., stream0);//stream0 is on Device 0;

...
//thread 1 on device 1:
cudaMemcpyAsync(Dst1, Src1, ..., stream1);//stream1 is on Device 1;

Can the two memcpy operations occur concurrently and get doubled host-device bandwidth (as long as the host memory bandwidth is sufficient)? if the answer is yes the is there an upper limit of such concurrency?

I plan to write some program for many (6-8) GPUs in a single compute node, so that will be quite critical for performance.

Upvotes: 0

Views: 659

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151799

Only one cudaMemcpy operation can actually be transferring data at any given time, per direction, across a PCIE link. Other operations can be queued up of course, but only one can be actually using the link.

Some CUDA GPUs have one DMA engine, others have 2. Those that have two can be simultaneously transferring data in both directions. That is the limit of concurrency in data transfers: one per direction (assuming dual copy engines).

Assuming the data transfers are not trivially small, then for the duration of each transfer, the PCIE bus will be fully utilized (in that direction).

Upvotes: 4

Related Questions