Reputation: 1265
I m new with cuda. I m writing code for image processing in cuda. My c and cuda code is below and i tried to convert into cuda, but it not works well.
My C code :
void imageProcess_usingPoints(int point, unsigned short *img)
{
// doing image process here using point variable value.
}
int main(int argc, char **argv)
{
/* here i define and initialize some variable */
int point=0;
unsigned short *image_data;
// consider that here i read image and store all pixels value in *image_data.
for(int i=0;i<1050;i++,point+=1580)
{
// calling image process function like bluring image.
imageProcess_usingPoints(point,image_data);
/* doing some image process using that point value on 16 bit grayscale image.*/
}
return 0;
}
i tried to convert my c code into cuda, but its wrong. So, my cuda code that what ever i tried is below.
__global__ void processOnImage(int pointInc)
{
int line = blockIdx.x * blockDim.x + threadIdx.x;
int point=((line)*pointInc));
/* here i m not getting exact vaue of point variable as same like in c code */
/* doing image processing here using point value */
}
int main(int argc, char **argv)
{
/* here i define and initialize some variable */
int pointInc=1580;
static const int BLOCK_WIDTH = 25;
int x = static_cast<int>(ceilf(static_cast<float>(1050) / BLOCK_WIDTH));
const dim3 grid (x,1);
const dim3 block(BLOCK_WIDTH,1);
processOnImage<<<grid,block>>>(pointInc);
return 0;
}
In processOnImage function of cuda code i m not getting exact value of point(int point) variable as in above c code. so what i m doing wrong in cuda code. Or how to use that block and thread for my code in c.
Upvotes: 0
Views: 315
Reputation: 9781
Basically you could set threads per block to a multiple of warpSize
(or just a multiple of 32)
http://docs.nvidia.com/cuda/cuda-c-programming-guide/#warpsize
Usually 256 is a good one for most simple kernels. The exact number has to be tuned. This tool in the CUDA installation dir can also help you choose the number.
$CUDA_HOME/tools/CUDA_Occupancy_Calculator.xls
After determining the thread number per block, you could then calculated the block number required by your data size. The following example shows how to do that.
https://developer.nvidia.com/content/easy-introduction-cuda-c-and-c
On the other hand, you could also use a fixed number of blocks for arbitrary data size. Sometimes you could get higher performance by this way. See this for more details.
https://developer.nvidia.com/content/cuda-pro-tip-write-flexible-kernels-grid-stride-loops
Upvotes: 1