In CUDA programming how to understand if the GPU kernel has completed its task?

Question

In CUDA programming suppose I am calling a kernel function from the host.

Suppose the kernel function is,

my_kernel_func(){
 doing some tasks utilizing multiple threads
}

Now from the host I am calling it using,

my_kernel_func<<>>();

In the NVDIA examples, they have called three more functions afterwards,

cudaGetLastError()

CUDA Doc : Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess.

cudaMemcpy()

CUDA Doc : Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. Calling cudaMemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.

and then

cudaDeviceSynchronize()

CUDA Doc : Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work.

Now I have tried to put a print statement at the end of the kernel function,

my_kernel_func(){
 doing some tasks utilizing multiple threads
 print D
}

As well as printed at different locations of the sequential flow,

cudaGetLastError()
print A
cudaMemcpy()
print B
cudaDeviceSynchronize()
print C

This thing prints in the following order

A
D
B
C

Basically, I need the time by which the kernel completes its task. Now I am confused to take the ending time. Because there should be a considerable time taken for copying back the data. Now if I put the ending time stamp after that it might incorporate the copying time also.

Is there any other function available to catch the ending?

kaby76 · Accepted Answer

As pointed out in the documentation, cudaMemcpy() exhibits synchronous behavior, so the cudaDeviceSynchronize() turns into a no-op because the synchronization was done at the memcpy.

The cudaGetLastError() checks whether you make an okay kernel call.

If you want to time for the kernel and not the memcpy's, switch the order of the cudaMemcpy()/cudaDeviceSynchronize() calls, start the timer just before the kernel call, then get the timer value after the cudaDeviceSynchronize() call. Make sure to test the result of cudaDeviceSynchronize() call as well.

In CUDA programming how to understand if the GPU kernel has completed its task?

Answers (1)

Related Questions