Reputation: 125
I am processing images of the dimension 2208x1242 from a video in a while-loop, using C++ with OpenCV.
To speed things up, I wanted to execute the operations on the GPU of my Nvidia Jetson Nano.
For the color conversion from BGR to HSV using cv::cuda::cvtColor
instead of cv::cvtColor
I achieve a speedup by factor 5.
Unfortunately, morphological operations are much slower on the GPU:
int num_frame = 10;
int frame = 0;
cv::Mat img;
cv::cuda::GpuMat img_gpu;
cv::Mat open_kernel = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(11, 11));
cv::Mat close_kernel = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(21, 21));
while (frame < num_frame){
// load image to img
// ...
img_gpu.upload(img);
cv::Ptr<cv::cuda::Filter> morph_filter_open = cv::cuda::createMorphologyFilter(cv::MORPH_OPEN, img_gpu.type(), open_kernel);
cv::Ptr<cv::cuda::Filter> morph_filter_close = cv::cuda::createMorphologyFilter(cv::MORPH_CLOSE, img_gpu.type(), close_kernel);
morph_filter_open->apply(img_gpu, img_gpu);
morph_filter_close->apply(img_gpu, img_gpu);
frame++;
}
Measuring only the apply()
-calls, the GPU version is about 20x slower than cv::morphologyEx
on the CPU of the Jetson Nano (0.07s vs. 1.5s for a single frame).
nvprof
shows, that most of the time is spent doing cudaDeviceSynchronize
(this is for the whole program doing more things that the code sample above, but the long running operations are probably related to the morphology):
API calls: 71.05% 17.2756s 665 25.978ms 25.730us 1.44814s cudaDeviceSynchronize
8.36% 2.03194s 1826 1.1128ms 34.844us 847.66ms cudaLaunchKernel
5.16% 1.25490s 1 1.25490s 1.25490s 1.25490s cuCtxDestroy
4.80% 1.16684s 544 2.1449ms 17.865us 10.378ms cudaMallocPitch
1.89% 460.14ms 616 746.98us 20.469us 346.82ms cudaFree
1.65% 401.38ms 76 5.2813ms 44.533us 19.211ms cudaMemcpy2D
1.45% 352.97ms 51 6.9209ms 18.803us 242.14ms cudaMalloc
1.42% 345.25ms 1 345.25ms 345.25ms 345.25ms cudaFuncGetAttributes
1.23% 299.95ms 1 299.95ms 299.95ms 299.95ms cuCtxCreate
1.03% 251.43ms 20 12.572ms 162.61us 103.74ms cudaMallocManaged
0.92% 224.67ms 13 17.283ms 32.553us 65.173ms cudaMemcpy
0.56% 135.48ms 1 135.48ms 135.48ms 135.48ms cudaDeviceReset
...
I hope someone can help me figure out what the problem is!
Upvotes: 3
Views: 1698
Reputation: 2678
I had the same problem, I managed to improve the performance of CUDA based morphology by some margin. Instead of creating morphology filter objects in the loop, I took out the object creation and put it outside of the image capture loop.
So the code should look like this:
int num_frame = 10;
int frame = 0;
cv::Mat img;
cv::cuda::GpuMat img_gpu;
cv::Mat open_kernel = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(11, 11));
cv::Mat close_kernel = cv::getStructuringElement(cv::MORPH_RECT,
cv::Size(21, 21));
// Morphology filter object creation outside the loop.
cv::Ptr<cv::cuda::Filter> morph_filter_open = cv::cuda::createMorphologyFilter(cv::MORPH_OPEN, img_gpu.type(), open_kernel);
cv::Ptr<cv::cuda::Filter> morph_filter_close = cv::cuda::createMorphologyFilter(cv::MORPH_CLOSE, img_gpu.type(), close_kernel);
while (frame < num_frame){
// load image to img
// ...
img_gpu.upload(img);
morph_filter_open->apply(img_gpu, img_gpu);
morph_filter_close->apply(img_gpu, img_gpu);
frame++;
}
I couldn't find any way to improve the performance of the CUDA morphology filter beyond this.
Upvotes: 0