Reputation: 53
I am making a beginner CUDA program which basically performs a downsample of an input greyscale image with OpenCV. Upon testing, it has worked well with an 8-bit greyscale image but gives a noisy downsampled image with the right half of the image a blank when a 16-bit greyscale image was given as the input. Below is the code I've written.
Sample input and output images are provided
and
My main.cpp code where the image is loaded in a Mat:
int main()
{
cv::Mat im1 = cv::imread("test.png", -1);
std::string output_file = "resultout.png";
binFilter(im1, output_file);
return 0;
}
My CUDA kernel code:
__global__ void binCUDAKernel(unsigned char *input, unsigned char *output, int binDim, int outputWidth, int outputHeight, int inputWstep, int outputWstep, int nChannels)
{
int outXind = blockIdx.x * blockDim.x + threadIdx.x;
int outYind = blockIdx.y * blockDim.y + threadIdx.y;
if ((outXind < outputWidth) && (outYind < outputHeight)) // Only run threads in output image coordinate range
{
if (nChannels == 1) // Test only for greyscale images
{
// Calculate x & y index of input binned pixels corresponding to current output pixel
int inXstart = outXind * binDim;
int inYstart = outYind * binDim;
// Perform binning on identified input pixels
float sum = 0;
for (int binY = inYstart; binY < (inYstart + binDim); binY++) {
for (int binX = inXstart; binX < (inXstart + binDim); binX++) {
int input_tid = binY * inputWstep + binX;
sum += input[input_tid];
}
}
// Establish output thread index in current output pixel index
int output_tid = outYind * outputWstep + outXind;
// Assign binned pixel value to output pixel
output[output_tid] = static_cast<unsigned short>(sum / (binDim*binDim));
}
}
}
My CPU code:
void binFilter(const cv::Mat input, std::string output_file)
{
// 2X2 binning
int binDim = 2;
// Create blank output image & calculate size of input and output
cv::Size outsize(input.size().width / binDim, input.size().height / binDim);
cv::Mat output(outsize, input.type());
const int inputBytes = input.step * input.rows;
const int outputBytes = output.step * output.rows;
// Allocate memory in device
unsigned char *d_input, *d_output;
gpuErrchk(cudaMalloc<unsigned char>(&d_input, inputBytes));
gpuErrchk(cudaMalloc<unsigned char>(&d_output, outputBytes));
// Copy input image to device
gpuErrchk(cudaMemcpy(d_input, input.ptr(), inputBytes, cudaMemcpyHostToDevice));
// Configure size of block and grid
const dim3 block(16, 16);
const dim3 grid((output.cols + block.x - 1) / block.x, (output.rows + block.y - 1) / block.y); // Additional block for rounding up
// Execute kernel
binCUDAKernel <<<grid, block>>> (d_input, d_output, binDim, output.cols, output.rows, input.step, output.step, input.channels());
gpuErrchk(cudaPeekAtLastError());
// Wait for all threads to finish
//gpuErrchk(cudaDeviceSynchronize());
// Copy output image from device back to host (cudaMemcpy is a blocking instruction)
gpuErrchk(cudaMemcpy(output.ptr(), d_output, outputBytes, cudaMemcpyDeviceToHost));
// Free device memory
gpuErrchk(cudaFree(d_input));
gpuErrchk(cudaFree(d_output));
// Write image to specified output_file path
cv::imwrite(output_file, output);
}
I suspect it might be a data type mismatch of some sort, but I couldn't figure it out.
Upvotes: 1
Views: 562
Reputation: 16816
First of all, for processing 16 bit images, the pixel data has to be interpreted as a 16 bit wide data type which can possibly be unsigned short
or short
. Remember that we only have to interpret the image data as unsigned short
type; not type cast it. For this, we will just convert the image data pointer to the required type as the following example:
unsigned short* ptr16 = reinterpret_cast<unsigned short*>(im1.ptr());
As a consequence of above step, we have to create a separate kernel for 16 bit datatype as well. We can do this smartly by defining the kernel as a C++ template. So the kernel may look like following:
template<typename T>
__global__ void binCUDAKernel(T *input, T *output, int binDim, int outputWidth, int outputHeight, int inputWstep, int outputWstep, int nChannels)
{
int outXind = blockIdx.x * blockDim.x + threadIdx.x;
int outYind = blockIdx.y * blockDim.y + threadIdx.y;
if ((outXind < outputWidth) && (outXind > outputWidth/2) && (outYind < outputHeight)) // Only run threads in output image coordinate range
{
if (nChannels == 1) // Test only for greyscale images
{
// Calculate x & y index of input binned pixels corresponding to current output pixel
int inXstart = outXind * binDim;
int inYstart = outYind * binDim;
// Perform binning on identified input pixels
float sum = 0;
for (int binY = inYstart; binY < (inYstart + binDim); binY++) {
for (int binX = inXstart; binX < (inXstart + binDim); binX++) {
int input_tid = binY * inputWstep + binX;
sum += float(input[input_tid]);
}
}
// Establish output thread index in current output pixel index
int output_tid = outYind * outputWstep + outXind;
// Assign binned pixel value to output pixel
output[output_tid] = static_cast<T>(sum / (binDim*binDim));
}
}
}
Another important catch during processing of OpenCV Mat using a custom CUDA kernel is that the image step has to be divided by size of the data type in bytes. For a 16 bit image, the size of a single pixel is 16 bits (2 bytes) so the step used inside the kernel has to be divided by 2. Remember that the original step should not be modified. Just the step value passed as the kernel argument should be divided.
Incorporating the above fixes, the final CPU code may look like this:
void binFilter(const cv::Mat input, std::string output_file)
{
// 2X2 binning
int binDim = 2;
// Create blank output image & calculate size of input and output
cv::Size outsize(input.size().width / binDim, input.size().height / binDim);
cv::Mat output(outsize, input.type());
const int inputBytes = input.step * input.rows;
const int outputBytes = output.step * output.rows;
// Allocate memory in device
unsigned char *d_input, *d_output;
gpuErrchk(cudaMalloc<unsigned char>(&d_input, inputBytes));
gpuErrchk(cudaMalloc<unsigned char>(&d_output, outputBytes));
// Copy input image to device
gpuErrchk(cudaMemcpy(d_input, input.ptr(), inputBytes, cudaMemcpyHostToDevice));
// Configure size of block and grid
const dim3 block(16, 16);
const dim3 grid((output.cols + block.x - 1) / block.x, (output.rows + block.y - 1) / block.y); // Additional block for rounding up
int depth = input.depth();
// Execute kernel
if (input.depth() == CV_16U)
{
typedef unsigned short t16;
t16* input16 = reinterpret_cast<t16*>(d_input);
t16* output16 = reinterpret_cast<t16*>(d_output);
int inputStep16 = input.step / sizeof(t16);
int outputStep16 = output.step / sizeof(t16);
binCUDAKernel <t16> <<<grid, block>>> (input16, output16, binDim, output.cols, output.rows, inputStep16, outputStep16, input.channels());
}
else
{
binCUDAKernel <unsigned char> <<<grid, block>>> (d_input, d_output, binDim, output.cols, output.rows, input.step, output.step, input.channels());
}
gpuErrchk(cudaPeekAtLastError());
// Wait for all threads to finish
//gpuErrchk(cudaDeviceSynchronize());
// Copy output image from device back to host (cudaMemcpy is a blocking instruction)
gpuErrchk(cudaMemcpy(output.ptr(), d_output, outputBytes, cudaMemcpyDeviceToHost));
// Free device memory
gpuErrchk(cudaFree(d_input));
gpuErrchk(cudaFree(d_output));
// Write image to specified output_file path
cv::imwrite(output_file, output);
}
The noise in the output image seems to be aliasing introduced due to the logic of binning algorithm. As an example, it is quite similar to re-sampling the image using nearest neighbor approach.
The above mentioned approach to calculate the memory address of a pixel is un-documented and is just a result of intuition so it may seem a bit unconventional. Another method used by OpenCV and other libraries avoids the confusion of image step division. It proceeds as follows given the x and y index of a pixel:
unsigned char*
).unsigned short*
).Using this method, we can compute the pixel memory address for grayscale image as follows:
template<typename T>
T* getPixelAddress(unsigned char* data, int x, int y, int step)
{
T* row = (T*)((unsigned char*)(data) + y * step);
return row + x;
}
In the above method, step value is the original one without any division.
Upvotes: 2