Howard
Howard

Reputation: 99

CUDA kernel only work with 1D thread index

There is a weird problem. I have following code. When I call first function it does not give correct result. However, when I call the function2 (the second function) it works fine. It is so weird to me. Does anyone has any idea about the problem? Thanks!!!

__global__ void function(int w, class<double> C, float *result) {  

    int r = threadIdx.x + blockIdx.x * blockDim.x;  
    int c = threadIdx.y + blockIdx.y * blockDim.y;  
    int half_w = w /2;  

    if (r < w && c < w) {  
        double dis = sort((double)(r - half_w) * (r - half_w) + (double)(c_half_w) * (c - half_w));  
    result[c * w + r] = (float)C.getVal(dis);  
    }  
}


__global__ void function2(int w, class<double> C, float *result) {  

    int tid = threadIdx.x + blockIdx.x * blockDim.x;  

    int half_w = w /2;
    int r = tid / w;  
    int c = tid % w;    

    if (r < w && c < w) {  
        double dis = sort((double)(r - half_w) * (r - half_w) + (double)(c_half_w) * (c - half_w));  
    result[c * w + r] = (float)C.getVal(dis);  
    }  
}

UPDATE: I use the function and function2 to draw an image. The pixel value is based on the distance between image center and current pixel position. Based on the distance, the class C getVal will calculate the value for the pixel. So, in the kernel, I just make every thread to calculate the distance and corresponding pixel value. The correct result is compared with CPU version. The function is just give some random value some very larger some very small. When I changed the result[c * w + r] = (float)C.getVal(dis) to result[c * w +r ] = 1.0f, the generated image seems does not change.

The image size is W x W, to launch function I set dim3 grid_dim(w / 64 + 1, w / 64 + 1); dim3 block_dim(64, 64); function<<<grid_dim, block_dim>>>(W, C, cu_img);

To launch function2 function2<<<W / 128 + 1, 128>>>(W, C, cu_img)

Fixed:

I got the problem. I assigned too many threads to one block. The max threads in one block is 1024 in my device. Actually, when I run cuds-memcheck, I can see the function2 does not even launched.

Upvotes: 1

Views: 435

Answers (1)

Howard
Howard

Reputation: 99

I solved the problem. I assigned too many threads to one block. The max threads in one block is 1024 in my device. Actually, when I ran cuda-memcheck, I can see the function2 was not ever launched.

Upvotes: 1

Related Questions