CUDA only processed half of total columns in an OpenCV 16-bit greyscale Mat

Question

I am making a beginner CUDA program which basically performs a downsample of an input greyscale image with OpenCV. Upon testing, it has worked well with an 8-bit greyscale image but gives a noisy downsampled image with the right half of the image a blank when a 16-bit greyscale image was given as the input. Below is the code I've written.

Sample input and output images are provided

and

My main.cpp code where the image is loaded in a Mat:

int main()
{
    cv::Mat im1 = cv::imread("test.png", -1);
    std::string output_file = "resultout.png";
    binFilter(im1, output_file);

    return 0;
}

My CUDA kernel code:

__global__ void binCUDAKernel(unsigned char *input, unsigned char *output, int binDim, int outputWidth, int outputHeight, int inputWstep, int outputWstep, int nChannels)
    {
        int outXind = blockIdx.x * blockDim.x + threadIdx.x;
        int outYind = blockIdx.y * blockDim.y + threadIdx.y;
        if ((outXind < outputWidth) && (outYind < outputHeight)) // Only run threads in output image coordinate range
        {
            if (nChannels == 1) // Test only for greyscale images
            {
                // Calculate x & y index of input binned pixels corresponding to current output pixel
                int inXstart = outXind * binDim;
                int inYstart = outYind * binDim;

                // Perform binning on identified input pixels
                float sum = 0;
                for (int binY = inYstart; binY < (inYstart + binDim); binY++) {
                    for (int binX = inXstart; binX < (inXstart + binDim); binX++) {
                        int input_tid = binY * inputWstep + binX;
                        sum += input[input_tid];
                    }
                }

                // Establish output thread index in current output pixel index
                int output_tid = outYind * outputWstep + outXind;

                // Assign binned pixel value to output pixel
                output[output_tid] = static_cast(sum / (binDim*binDim));
            }
        }
    }

My CPU code:

void binFilter(const cv::Mat input, std::string output_file)
{
    // 2X2 binning
    int binDim = 2;

    // Create blank output image & calculate size of input and output
    cv::Size outsize(input.size().width / binDim, input.size().height / binDim);
    cv::Mat output(outsize, input.type());
    const int inputBytes = input.step * input.rows;
    const int outputBytes = output.step * output.rows;

    // Allocate memory in device
    unsigned char *d_input, *d_output;
    gpuErrchk(cudaMalloc(&d_input, inputBytes));
    gpuErrchk(cudaMalloc(&d_output, outputBytes));

    // Copy input image to device
    gpuErrchk(cudaMemcpy(d_input, input.ptr(), inputBytes, cudaMemcpyHostToDevice));

    // Configure size of block and grid
    const dim3 block(16, 16);
    const dim3 grid((output.cols + block.x - 1) / block.x, (output.rows + block.y - 1) / block.y); // Additional block for rounding up

    // Execute kernel
    binCUDAKernel <<>> (d_input, d_output, binDim, output.cols, output.rows, input.step, output.step, input.channels());
    gpuErrchk(cudaPeekAtLastError());

    // Wait for all threads to finish
    //gpuErrchk(cudaDeviceSynchronize());

    // Copy output image from device back to host (cudaMemcpy is a blocking instruction)
    gpuErrchk(cudaMemcpy(output.ptr(), d_output, outputBytes, cudaMemcpyDeviceToHost));

    // Free device memory
    gpuErrchk(cudaFree(d_input));
    gpuErrchk(cudaFree(d_output));

    // Write image to specified output_file path
    cv::imwrite(output_file, output);
}

I suspect it might be a data type mismatch of some sort, but I couldn't figure it out.

sgarizvi · Accepted Answer

First of all, for processing 16 bit images, the pixel data has to be interpreted as a 16 bit wide data type which can possibly be unsigned short or short. Remember that we only have to interpret the image data as unsigned short type; not type cast it. For this, we will just convert the image data pointer to the required type as the following example:

unsigned short* ptr16 = reinterpret_cast(im1.ptr());

As a consequence of above step, we have to create a separate kernel for 16 bit datatype as well. We can do this smartly by defining the kernel as a C++ template. So the kernel may look like following:

template
__global__ void binCUDAKernel(T *input, T *output, int binDim, int outputWidth, int outputHeight, int inputWstep, int outputWstep, int nChannels)
{
    int outXind = blockIdx.x * blockDim.x + threadIdx.x;
    int outYind = blockIdx.y * blockDim.y + threadIdx.y;

    if ((outXind < outputWidth) && (outXind > outputWidth/2) && (outYind < outputHeight)) // Only run threads in output image coordinate range
    {
        if (nChannels == 1) // Test only for greyscale images
        {
            // Calculate x & y index of input binned pixels corresponding to current output pixel
            int inXstart = outXind * binDim;
            int inYstart = outYind * binDim;

            // Perform binning on identified input pixels
            float sum = 0;
            for (int binY = inYstart; binY < (inYstart + binDim); binY++) {
                for (int binX = inXstart; binX < (inXstart + binDim); binX++) {
                    int input_tid = binY * inputWstep + binX;
                    sum += float(input[input_tid]);
                }
            }

            // Establish output thread index in current output pixel index
            int output_tid = outYind * outputWstep + outXind;

            // Assign binned pixel value to output pixel
            output[output_tid] = static_cast(sum / (binDim*binDim));
        }
    }
}

Another important catch during processing of OpenCV Mat using a custom CUDA kernel is that the image step has to be divided by size of the data type in bytes. For a 16 bit image, the size of a single pixel is 16 bits (2 bytes) so the step used inside the kernel has to be divided by 2. Remember that the original step should not be modified. Just the step value passed as the kernel argument should be divided.

Incorporating the above fixes, the final CPU code may look like this:

void binFilter(const cv::Mat input, std::string output_file)
{
    // 2X2 binning
    int binDim = 2;

    // Create blank output image & calculate size of input and output
    cv::Size outsize(input.size().width / binDim, input.size().height / binDim);
    cv::Mat output(outsize, input.type());
    const int inputBytes = input.step * input.rows;
    const int outputBytes = output.step * output.rows;

    // Allocate memory in device
    unsigned char *d_input, *d_output;
    gpuErrchk(cudaMalloc(&d_input, inputBytes));
    gpuErrchk(cudaMalloc(&d_output, outputBytes));

    // Copy input image to device
    gpuErrchk(cudaMemcpy(d_input, input.ptr(), inputBytes, cudaMemcpyHostToDevice));

    // Configure size of block and grid
    const dim3 block(16, 16);
    const dim3 grid((output.cols + block.x - 1) / block.x, (output.rows + block.y - 1) / block.y); // Additional block for rounding up


    int depth = input.depth();
    // Execute kernel

    if (input.depth() == CV_16U)
    {
        typedef unsigned short t16;

        t16* input16 = reinterpret_cast(d_input);
        t16* output16 = reinterpret_cast(d_output);

        int inputStep16 = input.step / sizeof(t16);
        int outputStep16 = output.step / sizeof(t16);

        binCUDAKernel  <<>> (input16, output16, binDim, output.cols, output.rows, inputStep16, outputStep16, input.channels());
    }
    else
    {
        binCUDAKernel  <<>> (d_input, d_output, binDim, output.cols, output.rows, input.step, output.step, input.channels());   
    }


    gpuErrchk(cudaPeekAtLastError());

    // Wait for all threads to finish
    //gpuErrchk(cudaDeviceSynchronize());

    // Copy output image from device back to host (cudaMemcpy is a blocking instruction)
    gpuErrchk(cudaMemcpy(output.ptr(), d_output, outputBytes, cudaMemcpyDeviceToHost));

    // Free device memory
    gpuErrchk(cudaFree(d_input));
    gpuErrchk(cudaFree(d_output));

    // Write image to specified output_file path
    cv::imwrite(output_file, output);
}

The noise in the output image seems to be aliasing introduced due to the logic of binning algorithm. As an example, it is quite similar to re-sampling the image using nearest neighbor approach.

Update:

The above mentioned approach to calculate the memory address of a pixel is un-documented and is just a result of intuition so it may seem a bit unconventional. Another method used by OpenCV and other libraries avoids the confusion of image step division. It proceeds as follows given the x and y index of a pixel:

Reinterpret the image data pointer as byte representation (unsigned char*).
Calculate the start address of image row using y index and the image step.
Reinterpret the row start address as the required type (unsigned short*).
Access the x index of row start pointer.

Using this method, we can compute the pixel memory address for grayscale image as follows:

template
T* getPixelAddress(unsigned char* data, int x, int y, int step)
{
    T* row = (T*)((unsigned char*)(data) + y * step);
    return row + x;
}

In the above method, step value is the original one without any division.

CUDA only processed half of total columns in an OpenCV 16-bit greyscale Mat

Answers (1)

Update:

Related Questions