Reputation: 128
I'm developing an OpenCV application on the Jetson TK1. I'm using the OpenCV4Tegra package provided by NVIDIA.
dpkg -l output:
ii libopencv4tegra 2.4.10.1 armhf OpenCV4Tegra
ii libopencv4tegra-dev 2.4.10.1 armhf OpenCV4Tegra
ii libopencv4tegra-repo 2.4.10.1 armhf OpenCV4Tegra
I'm trying to get an idea of the speedup the Jetson can provide for my application.
I've tested copying data from the host to the device.
OpenCV code:
cv::Mat src_host = cv::imread("image.png");
cv::gpu::GpuMat src;
src.upload(src_host);
I've placed the upload call in a loop and timed it. It usually averages to about 10ms.
When I try similar Cuda code:
cv::Mat src_host = cv::imread("image.png");
int nb_bytes = src_host.rows*src_host.cols*src_host.elemSize1();
uchar* data;
cudaMalloc(&data, nb_bytes);
cudaMemcpy(data, src_host.data, nb_bytes, cudaMemcpyHostToDevice);
This code averages to about 50-100us.
When I try OpenCV operations like:
cv::gpu::GaussianBlur(src, dst, cv::Size(25, 25), 0);
This also takes an order of a magnitude longer than a custom Cuda implementation.
Am I using OpenCV's gpu functions incorrectly? Am I making incorrect assumptions?
Upvotes: 0
Views: 830
Reputation: 1074
If you launch your code using nvvp you will see that opencv call a cudaDeviceSynchronize after each operation you can do on your device.
To avoid these synchronization you must use their asynchronous API by creating a gpu::Stream and launch your operations into the stream.
Don't forget to put one synchronize after all your kernels calls.
You can also note that for some operations (erode/dilate/GaussianBlur/...) the first call is during much time than the others, to avoid that you can call them a once during your device initialization in order to benchmark your code correctly just after.
Upvotes: 4