mahmood
mahmood

Reputation: 24795

printf() in my CUDA kernel doesn't result produce any output

I have added some printf() statements in my CUDA program

__device__ __global__ void Kernel(float *, float * ,int );
void DeviceFunc(float *temp_h , int numvar , float *temp1_h)
{ .....
    //Kernel call
    printf("calling kernel\n");
    Kernel<<<dimGrid , dimBlock>>>(a_d , b_d , numvar);
    printf("kernel called\n");
  ....
}

int main(int argc , char **argv)
{   ....
    printf("beforeDeviceFunc\n\n");
    DeviceFunc(a_h , numvar , b_h); //Showing the data
    printf("after DeviceFunc\n\n");
    ....
}

Also in the Kernel.cu, I wrote:

#include<cuda.h>
#include <stdio.h>
__device__ __global__ void Kernel(float *a_d , float *b_d ,int size)
{
    int idx = threadIdx.x ;
    int idy = threadIdx.y ;
    //Allocating memory in the share memory of the device
    __shared__ float temp[16][16];

    //Copying the data to the shared memory
    temp[idy][idx] = a_d[(idy * (size+1)) + idx] ;
    printf("idx=%d, idy=%d, size=%d", idx, idy, size);
    ....
}

Then I compile using -arch=sm_20 like this:

nvcc -c -arch sm_20 main.cu
nvcc -c -arch sm_20 Kernel.cu
nvcc -arch sm_20 main.o Kernel.o -o main

Now when I run the program, I see:

beforeDeviceFunc

calling kernel
kernel called
after DeviceFunc

So the printf() inside the kernel is not printed. How can I fix that?

Upvotes: 7

Views: 15943

Answers (2)

tera
tera

Reputation: 7265

printf() output is only displayed if the kernel finishes successfully, so check the return codes of all CUDA function calls and make sure no errors are reported.

Furthermore printf() output is only displayed at certain points in the program. Appendix B.32.2 of the Programming Guide lists these as

  • Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as well),
  • Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(), cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(), or cuEventSynchronize(),
  • Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
  • Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
  • Context destruction via cudaDeviceReset() or cuCtxDestroy().
  • Prior to executing a stream callback added by cudaStreamAddCallback() or cuStreamAddCallback().

To check this is your problem, put the following code after your kernel invocation:

{
    cudaError_t cudaerr = cudaDeviceSynchronize();
    if (cudaerr != cudaSuccess)
        printf("kernel launch failed with error \"%s\".\n",
               cudaGetErrorString(cudaerr));
}

You should then see either the output of your kernel or an error message.

More conveniently, cuda-memcheck will automatically check all return codes for you if you run your executable under it. While you should always check for errors anyway, this comes handy when resolving concrete issues.

Upvotes: 16

Darth Jurassic
Darth Jurassic

Reputation: 668

I had the same error just now and decreasing the block size to 512 helped. According to documentation maximum block size can be either 512 or 1024.

I have written a simple test that showed that my GTX 1070 has a maximum block size of 1024. UPD: you can check if your kernel has ever executed by using cudaError_t cudaPeekAtLastError() that returns cudaSuccess if the kernel has started successfully, and only after it is worse calling cudaError_t cudaDeviceSynchronize().

Testing block size of 1023

Testing block size of 1024

Testing block size of 1025

CUDA error: invalid configuration argument

Block maximum size is 1024

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>

__global__
void set1(int* t)
{
    t[threadIdx.x] = 1;
}

inline bool failed(cudaError_t error)
{
    if (cudaSuccess == error)
        return false;

    fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(error));
    return true;
}

int main()
{
    int blockSize;
    for (blockSize = 1; blockSize < 1 << 12; blockSize++)
    {
        printf("Testing block size of %d\n", blockSize);
        int* t;
        if(failed(cudaMallocManaged(&t, blockSize * sizeof(int))))
        {
            failed(cudaFree(t));
            break;
        }
        for (int i = 0; i < blockSize; i++)
            t[0] = 0;
        set1 <<<1, blockSize>>> (t);
        if (failed(cudaPeekAtLastError()))
        {
            failed(cudaFree(t));
            break;
        }
        if (failed(cudaDeviceSynchronize()))
        {
            failed(cudaFree(t));
            break;
        }

        bool hasError = false;
        for (int i = 0; i < blockSize; i++)
            if (1 != t[i])
            {
                printf("CUDA error: t[%d] = %d but not 1\n", i, t[i]);
                hasError = true;
                break;
            }
        if (hasError)
        {
            failed(cudaFree(t));
            break;
        }

        failed(cudaFree(t));
    }
    blockSize--;
    if(blockSize <= 0)
    {
        printf("CUDA error: block size cannot be 0\n");
        return 1;
    }
    printf("Block maximum size is %d", blockSize);
    return  0;
}

P.S. Please note, that the only thing in block sizing is warp granularity which is 32 nowadays, so if 0 == yourBlockSize % 32 the warps are used pretty efficiently. The only reason to make blocks bigger then 32 is when the code needs synchronization as synchronization is available only among threads in a single block which makes a developer to use a single large block instead of many small ones. So running with higher number of smaller blocks can be even more efficient than running with lower number of larger blocks.

Upvotes: 0

Related Questions