J.Zhu
J.Zhu

Reputation: 11

"CudaLaunch returned (0x9)", and program timing issues

I wrote a CUDA program, I have two questions about this program.

  1. When I call the kernel function, I know that block_len must be <= 1024, but I still set block_len> 1024. When I debugged with cuda-gdb and Nsight, there was an expected "cudaLaunch returned (0x9)" error. If I run the program without debug, the program runs smoothly and the result of the calculation is the same as using the CPU (without parallelism), indicating that my calculations are correct. Why the wrong program can get the correct result?

  2. The program will calculate an length * length matrix A, the calculation of each element of A is done by one thread, ngridDim is set to (1,1). When length <32, the execution time of

    kernel <<< (1,1), (length, length) >>> 
    

    changes according to the regularity of the size of length. When length> 32, the time spent by the kernel suddenly decreases 100-1000 times. I first suspect that my timing code is wrong, but after checking, I think there is no mistake. Later, I will attach the timing code.What causes such a result?

    dim3 dimBlock(length, length);
    dim3 dimGrid(1, 1);
    
    float a2;
    cudaEvent_t t1, t2;
    cudaEventCreate(&t1);
    cudaEventCreate(&t2);
    
    cudaEventRecord(t1, 0);
    kernel<<<dimGrid, dimBlock>>>(dev_d, dev_D);
    cudaEventRecord(t2, 0);
    
    cudaEventSynchronize(t1);
    cudaEventSynchronize(t2);
    cudaEventElapsedTime(&a2,t1,t2);
    printf("kernel time: %f (ms)\n",a2);
    

If length=32,the kernel time is:

    kernel time: 37.341919 (ms)

If length=33,the kernel time is:

    kernel time: 0.004128 (ms)

Some information of my device:

Information of my devices screenshot

Upvotes: 1

Views: 539

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152143

You should provide a complete code. However:

  1. It may be that the correct results are still in memory from the previous run. When you set length > 32, then you have an illegal kernel launch, and your kernel will not run or produce any results. You can confirm this by clearing the output data before the kernel launch. For example, if dev_D contains the output of the kernel, then do something like this:

    cudaMemset(dev_D, 0, length*length*sizeof(dev_D[0]));
    kernel<<<dimGrid, dimBlock>>>(dev_d, dev_D);
    

    if you do that, and the kernel fails to run, you should definitely get 0 in dev_D instead of the expected result.

  2. When the length parameter is greater than 32, you are requesting more than 1024 threads per block, which is illegal in CUDA. So the kernel does not run (use proper CUDA error checking if you want to confirm this.) When the kernel does not run, the measured launch time is much shorter than when the kernel does run.

Upvotes: 1

Related Questions