Reputation: 1465
My matrix addition example:
__global__ void matrix_add(float *a, float*b, float *c, int N)
{
int index;
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int index = Row * N + Col; // N is the order of the square matrix
cd[index]= ad[index] + bd[index];
}
Can I use printf or any other similar function in above kernel? So that I won't need to transfer data from device to host memory (i.e. cudaMemcpyDeviceToHost
). If yes then how? If no then why not?
Upvotes: 0
Views: 335
Reputation: 131930
In addition to CUDA's built-in printf()
, You can use the CUDA support branch of this standalone printf library. This has the following benefits:
sprintf()
, so you can marshal your printing into a buffer rather than having to issue separate printf()'s at different times, increasing the risk of interruptions by printed output from other threads.vprintf()
and vsprintf()
.void *extra_arg
to pass to it.but it also has detriments:
printf()
Due disclosure: I adapted this library (original by Marco Paland) for use in CUDA and maintain the CPU-side version as well. So, I'm biased in its favor.
Upvotes: 0
Reputation: 15734
The only way to display results from your kernels without causing the data to be copied back to the host is to use one of the graphics interoperability modes. OpenGL and Direct3D interoperability is supported in CUDA. Examples on how to use these are in the CUDA Programming Guide.
__device__ printf()
(on compute capability >= 2.0) and __device__ cuPrintf()
(on compute capability < 2.0), both cause implicit copying of the printed strings back to the host. Very probably, both of these also cause implicit serialization of all kernels that try to print at the same time, thus are typically used only for debugging.
If you run your CUDA app in the debugger, the device values you view in the debugger have also been implicitly copied to the host.
It is unclear from your question if you want to avoid copying the values back to the host or if you want only to avoid having to explicitly copy the values over. If it's the latter, then the __device__ printf()
methods are viable for displaying small amounts of results on the host. Another way to avoid having to explicitly copy the values over is to use a thrust::device_vector
. Thrust is a library that comes with CUDA. It's inspired by the C++ STL. You can read and write to the device_vector
on the host side, and implicit copies are performed to and from the device in the background.
You can also cause implicit copying by using what is called mapped memory. With mapped memory, the CUDA hardware can perform implicit copying of memory between host and device as it is needed by your kernel.
The reason for all this is that copies between host and device are very expensive. Typically, they take up a big chunk of the total compute time. So it is necessary to carefully consider when and how these copies take place. All of the techniques I have mentioned have various performance implications, and how to best handle the copying is app specific.
Upvotes: 1
Reputation: 2870
You could use printf(..) but only for cc2.x or higher.
You could read more about this in the CUDA programming guide Appendix B.16.
Upvotes: 1