Reputation: 354
I'm writting app in c which can convert .png image into grayscale. I'm using c and cuda. I have problem with cuda code and I don't know why. (I'm cuda begginer).
My transformation function looks:
__global__
void setPixelToGrayscale(unsigned char *image)
{
int i = threadIdx.x*4;
float gray;
float r, g, b;
r = image[i + 0];
g = image[i + 1];
b = image[i + 2];
gray = .299f*r + .587f*g + .114f*b;
image[i + 0] = gray;
image[i + 1] = gray;
image[i + 2] = gray;
image[i + 3] = 255;
}
void transformToGrayCuda(rgb_image *img)
{
unsigned char* image = img->image;
unsigned char* image_d;
unsigned width = img->width;
unsigned height = img->height;
int N = (int)width * (int)height;
size_t size = N * sizeof(unsigned char);
cudaMalloc((void **) image_d, size);
cudaMemcpy(image_d, image, size, cudaMemcpyHostToDevice);
setPixelToGrayscale<<<1, N>>>(image_d);
cudaMemcpy(image, image_d, size, cudaMemcpyDeviceToHost);
cudaFree(image_d);
/* this works fine if cuda code is commented
int j=0;
for(j=0; j<N; j++)
{
int i = j*4;
float gray;
float r, g, b;
r = image[i + 0];
g = image[i + 1];
b = image[i + 2];
gray = .299f*r + .587f*g + .114f*b;
image[i + 0] = gray;
image[i + 1] = gray;
image[i + 2] = gray;
image[i + 3] = 255;
}
*/
}
I've done something wrong in cuda version because, when cuda code is commented and ill run in loop c code everything works fine. Why my cuda code doesn't work properly?
EDIT: it's my test image: https://i.sstatic.net/B3yJu.png
it's my result with cuda: https://i.sstatic.net/bzmWJ.png
it's my result with only c code: http:// [no space here, i have no rep] i.imgur.com/lU4vIiK.png
that's what i meant when i wrote that my cuda code does not work properly.
Upvotes: 0
Views: 1331
Reputation: 630
The problem in your code is:
cudaMalloc((void **) image_d, size);
You should give a pointer to the pointer, not cast the variable it to. The right code is:
cudaMalloc(&image_d, size);
Upvotes: 2
Reputation: 816
What is the size of N?. You are running all N threads in a single block. There is a limit of 512 or 1024 threads per block depending upon the GPU. Please change the number of blocks if N > 512. With nummber of blocks = 1 + N/ 512 and threads per block = 512. Here, you need to check in kernel if threadid < N to avoid accessing out-of-bounds memory.
Also, kernel executions are async. So, you need a cudadevicesynchronize() call after kernel invocation.
If you give exact error/ issue you are getting then I can provide more help.
Upvotes: 2