Reputation: 139
I used x
& y
for calculating cells of a matrix in device.
when I used more than 32 for lenA & lenB, the breakpoint (in int x= threadIdx.x;
in device code) can't work and output isn't correct.
in host code:
int lenA=52;
int lenB=52;
dim3 threadsPerBlock(lenA, lenB);
dim3 numBlocks(lenA / threadsPerBlock.x, lenB / threadsPerBlock.y);
kernel_matrix<<<numBlocks,threadsPerBlock>>>(dev_A, dev_B);
in device code:
int x= threadIdx.x;
int y= threadIdx.y;
...
Upvotes: 0
Views: 151
Reputation: 152249
Your threadsPerBlock
dim3 variable must satisfy the requirements for the compute capability that you are targetting.
CC 1.x devices can handle up to 512 threads per block
CC 2.0 - 8.6 devices can handle up to 1024 threads per block.
Your dim3 variable at (32,32) is specifying 1024 (=32x32) threads per block. When you exceed that you are getting a kernel launch fail.
If you did cuda error checking on your kernel launch, you would see the error.
Since the kernel doesn't actually launch with this type of error, any breakpoints set in the kernel code also won't be hit.
Additional notes:
You won't get any compilation error for threads per block, regardless of what you do. It doesn't work that way. The compiler doesn't check that.
If you do proper CUDA error checking you will get a runtime error report, and even if you don't do proper CUDA error checking, your kernel will not actually run with that sort of error.
Upvotes: 2