Reputation: 24179
I'm looking for a way to extract a part of an existing CUDA Toolkit example and turn it into a CUDAKernel executable in MATLAB.
In an attempt to obtain a short-runtime implementation of the non-local means (NLM) 2D filter, I stumbled upon the imageDenoising example provided with the CUDA Toolkit which implements two variants of this filter, called NLM & NLM2 (or "quick NLM").
Having no previous experience with CUDA coding, I initially attempted to follow MATLAB's documentation on the subject, which resulted in several strange errors including: ptx compilation, multiple entry points and wrong number of inputs in the C prototype. At this point I realized that this isn't going to be a "just works" case and that some tinkering is required.
So I decided to eliminate the multiple entry point issue by simply deleting parts of imageDenoising.cu
file and consolidating the relevant .cuh
(either ..._nlm_kernel.cuh
or ..._nlm2_kernel.cuh
) into the .cu
so as to obtain a single entry point at any given time.
To my surprise this actually managed to compile and I was finally able to create a CUDAKernel
without an error (using the command k = parallel.gpu.CUDAKernel('imageDenoising.ptx', 'uint8_T *, int, int, float, float');
).
This however was not enough, because I mistakenly concluded that the 1st argument is the unprocessed image in the form of an RGB matrix (i.e. X*Y*3 uint8
), and so the result I was getting back was exactly the input but with 0
in the 1st 4 elements.
After searching a bit more I realized that there are additional, and critical, aspects I'm entirely unaware of (like the need to initialize __device__
variables) to such a conversion process, at which stage I decided to ask for help.
I'm currently wondering how to efficiently continue from here. While I'd love to hear if this kind of approach can generally bear fruit (and whether a complete example of this process is available somewhere), which other pitfalls I should look out for, and what alternative courses of action I can take (considering my very limited knowledge in CUDA and the fact I won't hire anybody else to do this for me), I keep in mind that this is SO and so I must have a specific programming problem, so here goes:
How do I modify
imageDenoising.cu
such that the MATLABCUDAKernel
constructed from it will also accept the unprocessed image as an input?
Note: in my application, the input matrix is a 2d, grayscale, double
matrix.
Related: How CudaMalloc work?
P.S.
A working piece of code would obviously be welcomed, but I'd really rather "learn to fish".
Upvotes: 0
Views: 682
Reputation: 24179
I ended up taking an alternative approach to CUDAKernel, using .MEX
, by doing the following:
#include "mexopencv.hpp"
using namespace cv;
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
// Check arguments
if (nlhs != 1 || nrhs<1 || ((nrhs % 2) != 1) )
mexErrMsgIdAndTxt("fastNLM:invalidArgs", "Wrong number of arguments");
// Argument vector
vector<MxArray> rhs(prhs, prhs + nrhs);
// Option processing
// Defaults:
double h = 3;
int templateWindowSize = 7;
int searchWindowSize = 21;
// Parsing input name-value pairs:
for (int i = 1; i<nrhs; i += 2) {
string key = rhs[i].toString();
if (key == "h")
h = rhs[i + 1].toDouble();
else if (key == "templateWindowSize")
templateWindowSize = rhs[i + 1].toInt();
else if (key == "searchWindowSize")
searchWindowSize = rhs[i + 1].toInt();
else
mexErrMsgIdAndTxt("mexopencv:error", "Unrecognized option");
}
// Process
Mat src(rhs[0].toMat()), dst;
fastNlMeansDenoising(src, dst, h, templateWindowSize, searchWindowSize);
// Convert cv::Mat back to mxArray*
plhs[0] = MxArray(dst);
}
The answer to my question itself can be found by comparing opencv\sources\modules\photo\src\cuda\nlm.cu
(this is the opencv2 path) with imageDenoising_nlm2_kernel.cuh
.
This solution worked well for me because it was more important for me to get an NLM filter running, rather than using CUDAKernel.
The main lesson I learned from this (and I'd like to pass on to others) is:
Running CUDA code in MATLAB can also be done in ways other than CUDAKernel, such as using
.mex
wrappers as shown above.
Upvotes: 1