jr-be
jr-be

Reputation: 128

OpenCV on Jetson TK1 is much slower than custom Cuda code

I'm developing an OpenCV application on the Jetson TK1. I'm using the OpenCV4Tegra package provided by NVIDIA.

dpkg -l output:

ii  libopencv4tegra                                       2.4.10.1                                            armhf        OpenCV4Tegra
ii  libopencv4tegra-dev                                   2.4.10.1                                            armhf        OpenCV4Tegra
ii  libopencv4tegra-repo                                  2.4.10.1                                            armhf        OpenCV4Tegra

I'm trying to get an idea of the speedup the Jetson can provide for my application.

I've tested copying data from the host to the device.

OpenCV code:

cv::Mat src_host = cv::imread("image.png");
cv::gpu::GpuMat src;
src.upload(src_host);

I've placed the upload call in a loop and timed it. It usually averages to about 10ms.

When I try similar Cuda code:

cv::Mat src_host = cv::imread("image.png");
int nb_bytes = src_host.rows*src_host.cols*src_host.elemSize1();
uchar* data;
cudaMalloc(&data, nb_bytes);
cudaMemcpy(data, src_host.data, nb_bytes, cudaMemcpyHostToDevice);

This code averages to about 50-100us.

When I try OpenCV operations like:

cv::gpu::GaussianBlur(src, dst, cv::Size(25, 25), 0);

This also takes an order of a magnitude longer than a custom Cuda implementation.

Am I using OpenCV's gpu functions incorrectly? Am I making incorrect assumptions?

Upvotes: 0

Views: 830

Answers (1)

X3liF
X3liF

Reputation: 1074

If you launch your code using nvvp you will see that opencv call a cudaDeviceSynchronize after each operation you can do on your device.

To avoid these synchronization you must use their asynchronous API by creating a gpu::Stream and launch your operations into the stream.

Don't forget to put one synchronize after all your kernels calls.

You can also note that for some operations (erode/dilate/GaussianBlur/...) the first call is during much time than the others, to avoid that you can call them a once during your device initialization in order to benchmark your code correctly just after.

Upvotes: 4

Related Questions