Reputation: 4884
GPU: GeForce GTX 750
CPU: Intel i5-4440 3.10 GHz
Here is a simple C++ code I'm running.
#include <iostream>
#include "opencv2/highgui/highgui.hpp"
#include "opencv2\gpu\gpu.hpp"
int main(int argc, char** argv) {
cv::Mat img0 = cv::imread("IMG_0984.jpg", CV_LOAD_IMAGE_GRAYSCALE); // Size 3264 x 2448
cv::Mat img0Blurred;
cv::gpu::GpuMat gpuImg0(img0);
cv::gpu::GpuMat gpuImage0Blurred;
int64 tickCount;
for (int i = 0; i < 5; i++)
{
tickCount = cv::getTickCount();
cv::blur(img0, img0Blurred, cv::Size(7, 7));
std::cout << "CPU Blur " << (cv::getTickCount() - tickCount) / cv::getTickFrequency() << std::endl;
tickCount = cv::getTickCount();
cv::gpu::blur(gpuImg0, gpuImage0Blurred, cv::Size(7, 7));
std::cout << "GPU Blur " << (cv::getTickCount() - tickCount) / cv::getTickFrequency() << std::endl;
}
cv::gpu::DeviceInfo deviceInfo;
std::cout << "Device Info: "<< deviceInfo.name() << std::endl;
std::cin.get();
return 0;
}
And as a result, I am usually getting something like this:
CPU Blur: 0.01
GPU Blur: 1.7
CPU Blur: 0.009
GPU Blur: 0.012
CPU Blur: 0.009
GPU Blur: 0.013
CPU Blur: 0.01
GPU Blur: 0.012
CPU Blur: 0.009
GPU Blur: 0.013
Device Info: GeForce GTX 750
So the first operation on GPU takes time.
But still, what about the rest of the GPU calls?
How come the GPU does not provide any acceleration for this. It is a big image after all (3264 x 2448). And the task is nice for parallelization, is it not?
Is my CPU that good, or is my GPU that bad? Or is this some kind of communication problem between components?
Upvotes: 2
Views: 5877
Reputation: 1074
Your first gpu measurement is far from the others,i've experienced the same thing. The first call to an opencv kernel (erode/dilate/etc...) is longer than the others following. In an application, while we initializes GPU memory, we have made a first call to cv::gpu::XX in order to not having this measurement noise.
I've also seen that cv::gpu uses cudaDeviceSynchronize after each calls without an cv::gpu::Stream parameter. This can be long and cause you noisy measurements. Then opencv probably allocates memory for a temporary buffer to store the kernel you use to blur the image.
I don't see the allocation of gpuImage0Blurred in your example, can you be sure that your destination image is correctly allocated outside the loop, else you'll too measure the allocation time for this matrix.
Using nvvp can give you clues of what is really happening when your application runs to remove unnecessary operations.
EDIT:
#include <iostream>
#include "opencv2/highgui/highgui.hpp"
#include "opencv2\gpu\gpu.hpp"
int main(int argc, char** argv) {
cv::Mat img0 = cv::imread("IMG_0984.jpg", CV_LOAD_IMAGE_GRAYSCALE); // Size 3264 x 2448
cv::Mat img0Blurred;
cv::gpu::GpuMat gpuImg0;
cv::gpu::Stream stream;
stream.enqueueUpload(img0, gpuImg0);
stream.waitForCompletion();
// allocates the matrix outside the loop
cv::gpu::GpuMat gpuImage0Blurred( gpuImg0.size(), gpuImg0.type() );
int64 tickCount;
for (int i = 0; i < 5; i++)
{
tickCount = cv::getTickCount();
cv::blur(img0, img0Blurred, cv::Size(7, 7));
std::cout << "CPU Blur " << (cv::getTickCount() - tickCount) / cv::getTickFrequency() << std::endl;
tickCount = cv::getTickCount();
cv::gpu::blur(gpuImg0, gpuImage0Blurred, cv::Size(7, 7), cv::Point(-1, -1), stream);
// ensure operations are finished before measuring time spent doing operations
stream.WaitCompletion();
std::cout << "GPU Blur " << (cv::getTickCount() - tickCount) / cv::getTickFrequency() << std::endl;
}
std::cin.get();
return 0;
}
Yes, it turns out waitForCompletion
makes all the difference.
I am getting the same values like in the beginning:
CPU Blur: 0.01
GPU Blur: 1.7
CPU Blur: 0.009
GPU Blur: 0.012
CPU Blur: 0.009
GPU Blur: 0.013
CPU Blur: 0.01
GPU Blur: 0.012
CPU Blur: 0.009
GPU Blur: 0.013
Upvotes: 6