Reputation: 99
There is a weird problem. I have following code. When I call first function it does not give correct result. However, when I call the function2 (the second function) it works fine. It is so weird to me. Does anyone has any idea about the problem? Thanks!!!
__global__ void function(int w, class<double> C, float *result) {
int r = threadIdx.x + blockIdx.x * blockDim.x;
int c = threadIdx.y + blockIdx.y * blockDim.y;
int half_w = w /2;
if (r < w && c < w) {
double dis = sort((double)(r - half_w) * (r - half_w) + (double)(c_half_w) * (c - half_w));
result[c * w + r] = (float)C.getVal(dis);
}
}
__global__ void function2(int w, class<double> C, float *result) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int half_w = w /2;
int r = tid / w;
int c = tid % w;
if (r < w && c < w) {
double dis = sort((double)(r - half_w) * (r - half_w) + (double)(c_half_w) * (c - half_w));
result[c * w + r] = (float)C.getVal(dis);
}
}
UPDATE:
I use the function
and function2
to draw an image. The pixel value is based on the distance between image center and current pixel position. Based on the distance, the class C getVal will calculate the value for the pixel. So, in the kernel, I just make every thread to calculate the distance and corresponding pixel value. The correct result is compared with CPU version. The function
is just give some random value some very larger some very small. When I changed the result[c * w + r] = (float)C.getVal(dis)
to result[c * w +r ] = 1.0f
, the generated image seems does not change.
The image size is W x W, to launch function
I set
dim3 grid_dim(w / 64 + 1, w / 64 + 1);
dim3 block_dim(64, 64);
function<<<grid_dim, block_dim>>>(W, C, cu_img);
To launch function2
function2<<<W / 128 + 1, 128>>>(W, C, cu_img)
Fixed:
I got the problem. I assigned too many threads to one block. The max threads in one block is 1024 in my device. Actually, when I run cuds-memcheck, I can see the function2
does not even launched.
Upvotes: 1
Views: 435
Reputation: 99
I solved the problem. I assigned too many threads to one block. The max threads in one block is 1024 in my device. Actually, when I ran cuda-memcheck, I can see the function2
was not ever launched.
Upvotes: 1