Reputation: 582
i have this code snippet I came across and I'm trying to use OpenMP to make it run faster than the original version. However, in comparison this seems to be taking about the same amount of time as the older version. Not sure why this multithreading approach is not working to optimize it. Like the timings are still the same. What can I do to make it run even faster?:
void sobel(unsigned char *data_out,
unsigned char *data_in, unsigned height,
unsigned width)
{
/* Sobel matrices for convolution */
int sobelv[3][3] = { {-1, -2, -1}, {0, 0, 0}, {1, 2, 1} };
int sobelh[3][3] = { {-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1} };
unsigned int size, i, j;
int lay;
size = height * width;
#ifdef OPENMP
#pragma omp parallel for collapse(64) shared (data_in,data_out,sobelv, sobelh,size) private (i,j,lay)
#endif
for (lay = 0; lay < 3; lay++) {
for (i = 1; i < height - 1; ++i) {
for (j = 1; j < width - 1; j++) {
int sumh, sumv;
int k = -1, l = -1;
sumh = 0;
sumv = 0;
/* Convolution part */
for ( k = -1; k < 2; k++)
for (l = -1; l < 2; l++) {
sumh =
sumh + sobelh[k + 1][l + 1] *(int) data_in[lay * size + (i + k) * width +(j + l)];
sumv =
sumv + sobelv[k + 1][l +1] * (int) data_in[lay *size +(i +k) *width + (j +l)];
}
int temp = abs(sumh / 8) + abs(sumv / 8);
data_out[lay * size + i * width + j] =
(temp > 255? 255: temp);
}
}
}
}
the main function is simply calling this function like this: sobel(data_out, data_in, header.height, header.width);
any help would be appreciated!! :)
Upvotes: 0
Views: 979
Reputation: 50279
The best optimization you can apply is to vectorize the code. Compilers can often auto-vectorize the code when it is sufficiently simple but this one is too complex for most compilers (including GCC and Clang) to vectorize it.
Manual code vectorization is cumbersome error-prone and often make the code (more) dependent of a specific architecture (eg. x86-64). However, you can help the compiler to generate it for you. To do that, you it better to:
-O3
for GCC/Clang, possibly coupled with -mavx2
if your target platform supports the AVX-2 instruction set, or with -march=native
if your target platform is the one where the program is built;memcpy
calls, restrict
compiler extensions, etc.) [thanks to @Laci].You can check the generated assembly code to see if the code is vectorized or not.
Moreover, using collapse(2)
should enough here to get a good speed-up. collapse(3)
can introduce some unwanted overheads due to the last loop being shared amongst threads. collapse(64)
is not correct (it cannot be bigger than the number of nested loops).
Here is the resulting untested code:
#include <cmath>
void sobel(unsigned char *data_out,
unsigned char *data_in, int height,
int width)
{
const int size = height * width;
#ifdef OPENMP
#pragma omp parallel for collapse(2) shared(data_in,data_out,size)
#endif
for (int lay = 0; lay < 3; lay++)
{
for (int i = 1; i < height - 1; ++i)
{
for (int j = 1; j < width - 1; j++)
{
short a11 = data_in[lay * size + (i-1) * width + (j-1)];
short a12 = data_in[lay * size + (i-1) * width + j];
short a13 = data_in[lay * size + (i-1) * width + (j+1)];
short a21 = data_in[lay * size + i * width + (j-1)];
short a23 = data_in[lay * size + i * width + (j+1)];
short a31 = data_in[lay * size + (i+1) * width + (j-1)];
short a32 = data_in[lay * size + (i+1) * width + j];
short a33 = data_in[lay * size + (i+1) * width + (j+1)];
short sumh = a13 - a11 + (a23 - a21) + (a23 - a21) + a33 - a31;
short sumv = a31 + a32 + a32 + a33 - (a11 + a12 + a12 + a13);
short temp = (abs(sumh) >> 3) + (abs(sumv) >> 3);
data_out[lay * size + i * width + j] = (temp > 255? 255: temp);
}
}
}
}
I expect the code to be several time faster (especially true in sequential) -- typically about 10 times faster with AVX-2 since the processor can work on 16 values at once (despite a bit more work related to SIMD instructions).
Another possible optimization you can do is called register blocking. The idea is to change the loop so that you work on small fixed-size tiles (eg. 2x2 or 4x2 SIMD values). This should reduces the number of L1-cache loads and the number of char-to-short/short-to-char conversions to perform. However, this is hard to help the compiler so it does this optimization correctly on such a code. It is probably better to use SIMD intrinsics if performance is critical and do the register blocking yourself.
Upvotes: 3