"CudaLaunch returned (0x9)", and program timing issues

Question

I wrote a CUDA program, I have two questions about this program.

When I call the kernel function, I know that block_len must be <= 1024, but I still set block_len> 1024. When I debugged with cuda-gdb and Nsight, there was an expected "cudaLaunch returned (0x9)" error. If I run the program without debug, the program runs smoothly and the result of the calculation is the same as using the CPU (without parallelism), indicating that my calculations are correct. Why the wrong program can get the correct result？
The program will calculate an length * length matrix A, the calculation of each element of A is done by one thread, ngridDim is set to (1,1). When length <32, the execution time of
```
kernel <<< (1,1), (length, length) >>> 
```
changes according to the regularity of the size of length. When length> 32, the time spent by the kernel suddenly decreases 100-1000 times. I first suspect that my timing code is wrong, but after checking, I think there is no mistake. Later, I will attach the timing code.What causes such a result？
```
dim3 dimBlock(length, length);
dim3 dimGrid(1, 1);

float a2;
cudaEvent_t t1, t2;
cudaEventCreate(&t1);
cudaEventCreate(&t2);

cudaEventRecord(t1, 0);
kernel<<>>(dev_d, dev_D);
cudaEventRecord(t2, 0);

cudaEventSynchronize(t1);
cudaEventSynchronize(t2);
cudaEventElapsedTime(&a2,t1,t2);
printf("kernel time: %f (ms)
",a2);
```

If length=32,the kernel time is:

    kernel time: 37.341919 (ms)

If length=33,the kernel time is:

    kernel time: 0.004128 (ms)

Some information of my device:

Information of my devices screenshot

Answers (1)