Reputation: 1151
So im working through openMP trying to research about whether a cpu or gpu will run image blurring faster/slower/or moderately the relative. From what I think a gpu should run it a bit faster because gpu's perform small operations relatively quick correct? As a cpu can perform complex operations at a reasonable time:
So here is the code I'm using to test it:
IplImage* gaussian_blur_parallel(IplImage* image, double r) {
IplImage* result = cvCloneImage(image);
int h = image->height;
int w = image->width;
double rs = ceil(r * 2.57); // significant radius
std::clock_t start;
start = std::clock();
#pragma omp parallel for schedule(guided) num_threads(4)
for(int i=0; i<h; i++) {
int current_num_threads = omp_get_num_threads();
std::cout<<"threads"<<current_num_threads<<std::endl;
for (int j = 0; j < w; j++) {
Weights weights;
for(int iy = i-rs; iy<i+rs+1; iy++) {
for (int ix = j - rs; ix < j + rs + 1; ix++) {
int x = myMin(w - 1, myMax(0, ix));
int y = myMin(h - 1, myMax(0, iy));
double dsq = (ix - j) * (ix - j) + (iy - i) * (iy - i);
double wght = exp(-dsq / (2 * r * r)) / (PI * 2 * r * r);
CvScalar channels = cvGet2D(image, y, x);
// calculate the value for each channel
for (int c = 0; c < 3; c++) {
weights.value[c] += channels.val[c] * wght;
weights.weight[c] += wght;
}
}
}
// set the value for each channel in the resulting image.
// printf("i=%d, j=%d, r=%f, g=%f, b=%f\n", i, j, weights.value[0], weights.value[1], weights.value[2]);
CvScalar resultingChannels = cvGet2D(result, i, j);
for(int c=0; c < 3; c++) {
resultingChannels.val[c] = round(weights.value[c] / weights.weight[c]);
weights.value[c] = 0.0;
weights.weight[c] = 0.0;
}
cvSet2D(result, i, j, resultingChannels);
}
}
std::cout << "Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
return result;
}
from what im seeing in the documentation, anything ran in the inside that pragma comment should be doing work on a gpu correct?
however if I do it without the pragma comment(Im assuming this is the cpu working..)
IplImage* gaussian_blur(IplImage* image, double r) {
IplImage* result = cvCloneImage(image);
int h = image->height;
int w = image->width;
printf("h=%d, w=%d", h, w);
double rs = ceil(r * 2.57); // significant radius
std::clock_t start;
start = std::clock();
for(int i=0; i<h; i++) {
for (int j = 0; j < w; j++) {
Weights weights;
for(int iy = i-rs; iy<i+rs+1; iy++) {
for (int ix = j - rs; ix < j + rs + 1; ix++) {
int x = myMin(w - 1, myMax(0, ix));
int y = myMin(h - 1, myMax(0, iy));
double dsq = (ix - j) * (ix - j) + (iy - i) * (iy - i);
double wght = exp(-dsq / (2 * r * r)) / (PI * 2 * r * r);
CvScalar channels = cvGet2D(image, y, x);
// calculate the value for each channel
for (int c = 0; c < 3; c++) {
weights.value[c] += channels.val[c] * wght;
weights.weight[c] += wght;
}
}
}
// set the value for each channel in the resulting image.
// printf("i=%d, j=%d, r=%f, g=%f, b=%f\n", i, j, weights.value[0], weights.value[1], weights.value[2]);
CvScalar resultingChannels = cvGet2D(result, i, j);
for(int c=0; c < 3; c++) {
resultingChannels.val[c] = round(weights.value[c] / weights.weight[c]);
weights.value[c] = 0.0;
weights.weight[c] = 0.0;
}
cvSet2D(result, i, j, resultingChannels);
}
}
std::cout << "Time: " << (std::clock() - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
return result;
}
The time for this to work is relatively the same as the gpu doing.
So my question really is, is what i'm doing irrelevant and not really make much a difference for a gpu? IS it that im setting the clock in the wrong place? How can i check how many threads are being used properly in parallel in OpenMP?
Upvotes: 0
Views: 649
Reputation: 74475
First of all, you are using the wrong method to measure time. std::clock
measures the CPU time, not the wall-clock time. You will never see a decrease of the value measured when you run multithreaded, unless some kind of superlinear speedup effect comes into play. Use omp_get_wtime()
instead.
Second, the parallel for
construct will not execute on an accelerator device unless nested within the scope of a target
construct:
double start;
start = omp_get_wtime();
#pragma omp target ...
#pragma omp parallel for schedule(guided)
for(int i=0; i<h; i++) {
...
}
std::cout << "Time: " << (omp_get_wtime() - start) * 1000.0 << " ms" << std::endl;
Execution on GPUs and other accelerators with their own memory space requires proper setup of the data environment. This includes adding various map
clauses to the target
directive, which instruct the compiler to copy certain information onto the device before the execution and out of it after the offloaded region has finished. Take a look at the documentation of the map
clause, then examine carefully all the variables used in the code and write the corresponding clauses.
Also, pay attention to the fact that all functions called from within the target
region must be specifically marked with the declare target
construct. That applies to functions like cvGet2D()
, cvSet2D()
, myMin()
, and myMax()
. If those are implemented as preprocessor macros, it is not necessary to declare them as target functions. Otherwise, the compiler won't generate versions callable on the device, which will result in errors.
Once all this is in place, you have to pass the right command-line options to the compiler in order to enable generation of target code. Also, make sure that your compiler supports GPU offloading. For example, the Intel compiler supports offloading to Intel Xeon Phi only. GCC should support offloading to Intel Xeon Phi and NVIDIA GPUs in its latest incarnation, but seems to still have some issues. Your current best option for OpenMP offloading to NVIDIA GPUs is the PGI compiler suite. No idea what the situation with offloading to AMD GPUs is.
Upvotes: 0