Doug
Doug

Reputation: 3003

cudaMemcpy & blocking

I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy.

I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished?

e.g. Does this code provide the same blocking operation?

SomeCudaCall<<<25,34>>>(someData);
cudaThreadSynchronize();

vs

SomeCudaCall<<<25,34>>>(someParam);
cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice);

Upvotes: 8

Views: 7487

Answers (2)

Luc
Luc

Reputation: 445

According to the NVIDIA Programming guide:

In order to facilitate concurrent execution between host and device, some function calls are asynchronous: Control is returned to the host thread before the device has completed the requested task. These are:

  • Kernel launches;
  • Memory copies between two addresses to the same device memory;
  • Memory copies from host to device of a memory block of 64 KB or less;
  • Memory copies performed by functions that are suffixed with Async;
  • Memory set function calls.

So as long as your transfer size is larger than 64KB your examples are equivalent.

Upvotes: 6

perreal
perreal

Reputation: 98118

Your examples are equivalent. If you want asynchronous execution you can use streams or contexts and cudaMemcpyAsync, so that you can overlap execution with copy.

Upvotes: 8

Related Questions