Reputation: 411
I have a problem when I try to get access to each point of matrix in CUDA kernel. I'm working with OpenCV and I'm trying to "do something" on each point of matrix.
So, I'm converting uint8_t
matrix to float
matrix like this:
for(int i=0; i<inputMatrix.rows; ++i){
for(int j=0; j<inputMatrix.cols * cn; j+=cn){
examMatrix[i*inputMatrix.cols*cn + j + 0] = pixelPtr[i*inputMatrix.cols*cn + j + 0]; // B
examMatrix[i*inputMatrix.cols*cn + j + 1] = pixelPtr[i*inputMatrix.cols*cn + j + 1]; // G
examMatrix[i*inputMatrix.cols*cn + j + 2] = pixelPtr[i*inputMatrix.cols*cn + j + 2]; // R
}
}
And this works for 3 channels image cause if I created output image from this matrix (after back conversion to uint8_t
) looks same as input.
But I want to make some changes using CUDA:
I set block size and grid size like this:
dim3 dimBlock(count, 3);
dim3 dimGrid( frameHeight/count, frameWidth/count);
Where count
is thread number, 3
is channel number, frameHeight
and frameWidth
are frame size.
So, I allocated GPUexamMatrix
and GPUresultMatrix
and tried to access to each point in kernel. My kernel looks like this:
resultMatrix[(blockIdx.x * blockIdx.y) + (threadIdx.x * threadIdx.y)] = examMatrix[(blockIdx.x * blockIdx.y) + (threadIdx.x * threadIdx.y)];
So, as you can see I tried to simply copy matrix. After this operation, when I returned my matrix to host and printed it I've got really small or really big float
numbers inside matrix, but not the numbers from examine matrix.
I suppose I'm doing something wrong inside kernel. Any ideas?
Upvotes: 0
Views: 153
Reputation: 151799
Your usage of the built-in variables can't possibly be correct. To give just one simple example, threadIdx.x = 0 and threadIdx.y = 2 will access the same point as threadIdx.x = 2 and threadIdx.y = 0. You don't have unique indexing. I could try and give you something that will work, but I'm a little confused by your dimBlock variable. In short, this is not how I would set up grid/block/indexing to handle a 2D array. I wouldn't use 3, the channel number, in my threadblock dimensioning
Try something like this:
// make sure count is small like 16: count*count<512 or 1024 depending on GPU
dim3 dimBlock(count, count);
dim3 dimGrid( frameWidth/dimBlock.x, frameHeight/dimBlock.y);
and in your kernel:
int row = threadIdx.y + blockIdx.y*blockDim.y;
int col = threadIdx.x + blockIdx.x*blockDim.x;
resultMatrix[3*(row*frameWidth + col) + 0] = examMatrix[3*(row*frameWidth + col) + 0];
resultMatrix[3*(row*frameWidth + col) + 1] = examMatrix[3*(row*frameWidth + col) + 1];
resultMatrix[3*(row*frameWidth + col) + 2] = examMatrix[3*(row*frameWidth + col) + 2];
The above assumes frameWidth
and frameHeight
are evenly divisible by count
Upvotes: 1