sandeep.ganage
sandeep.ganage

Reputation: 1465

Can we use printf or any other similar function in a CUDA Kernel?

My matrix addition example:

 __global__ void matrix_add(float *a, float*b, float *c, int N)
{
    int index;
    int Row = blockIdx.y * blockDim.y + threadIdx.y;
    int Col = blockIdx.x * blockDim.x + threadIdx.x;

    int index = Row * N + Col;      // N is the order of the square matrix

    cd[index]= ad[index] + bd[index];

}

Can I use printf or any other similar function in above kernel? So that I won't need to transfer data from device to host memory (i.e. cudaMemcpyDeviceToHost). If yes then how? If no then why not?

Upvotes: 0

Views: 335

Answers (3)

einpoklum
einpoklum

Reputation: 131930

In addition to CUDA's built-in printf(), You can use the CUDA support branch of this standalone printf library. This has the following benefits:

  • More complete feature set for the format string: More format specifiers and more supported options.
  • Availability of sprintf(), so you can marshal your printing into a buffer rather than having to issue separate printf()'s at different times, increasing the risk of interruptions by printed output from other threads.
  • Use of vprintf() and vsprintf().
  • Support for behavior customization via a closure: A user-provided single-character output function and an opaque void *extra_arg to pass to it.

but it also has detriments:

  • Extra dependency for your project; you need to download, build and install or otherwise make the library available to your project. And you need to carefully link your compiled device-side code against it.
  • Longer build time than just using built-in printf()
  • While the library is widely-used on all sorts of devices - it's not widely-used on GPUs.

Due disclosure: I adapted this library (original by Marco Paland) for use in CUDA and maintain the CPU-side version as well. So, I'm biased in its favor.

Upvotes: 0

Roger Dahl
Roger Dahl

Reputation: 15734

The only way to display results from your kernels without causing the data to be copied back to the host is to use one of the graphics interoperability modes. OpenGL and Direct3D interoperability is supported in CUDA. Examples on how to use these are in the CUDA Programming Guide.

__device__ printf() (on compute capability >= 2.0) and __device__ cuPrintf() (on compute capability < 2.0), both cause implicit copying of the printed strings back to the host. Very probably, both of these also cause implicit serialization of all kernels that try to print at the same time, thus are typically used only for debugging.

If you run your CUDA app in the debugger, the device values you view in the debugger have also been implicitly copied to the host.

It is unclear from your question if you want to avoid copying the values back to the host or if you want only to avoid having to explicitly copy the values over. If it's the latter, then the __device__ printf() methods are viable for displaying small amounts of results on the host. Another way to avoid having to explicitly copy the values over is to use a thrust::device_vector. Thrust is a library that comes with CUDA. It's inspired by the C++ STL. You can read and write to the device_vector on the host side, and implicit copies are performed to and from the device in the background.

You can also cause implicit copying by using what is called mapped memory. With mapped memory, the CUDA hardware can perform implicit copying of memory between host and device as it is needed by your kernel.

The reason for all this is that copies between host and device are very expensive. Typically, they take up a big chunk of the total compute time. So it is necessary to carefully consider when and how these copies take place. All of the techniques I have mentioned have various performance implications, and how to best handle the copying is app specific.

Upvotes: 1

brano
brano

Reputation: 2870

You could use printf(..) but only for cc2.x or higher.
You could read more about this in the CUDA programming guide Appendix B.16.

Upvotes: 1

Related Questions